GPCRTree: online hierarchical classification of GPCR function
© Davies et al; licensee BioMed Central Ltd. 2008
Received: 08 August 2008
Accepted: 21 August 2008
Published: 21 August 2008
G protein-coupled receptors (GPCRs) play important physiological roles transducing extracellular signals into intracellular responses. Approximately 50% of all marketed drugs target a GPCR. There remains considerable interest in effectively predicting the function of a GPCR from its primary sequence.
Using techniques drawn from data mining and proteochemometrics, an alignment-free approach to GPCR classification has been devised. It uses a simple representation of a protein's physical properties. GPCRTree, a publicly-available internet server, implements an algorithm that classifies GPCRs at the class, sub-family and sub-subfamily level.
A selective top-down classifier was developed which assigns sequences within a GPCR hierarchy. Compared to other publicly available GPCR prediction servers, GPCRTree is considerably more accurate at every level of classification. The server has been available online since March 2008 at URL: http://igrid-ext.cryst.bbk.ac.uk/gpcrtree/.
The G protein-coupled receptors (GPCR) comprise a diverse range of integral membrane proteins regulating many important physiological functions [1–3]. Ligand binding to a GPCR on the cell surface initiates cell signaling. An extremely heterogeneous set of molecules act as GPCR ligands. The GPCRs are a common target for therapeutic drugs and approximately 50% of all marketed drugs target GPCRs [4, 5]. In spite of their functional and sequence diversity, GPCRs share certain common structural features, but show a far greater conservation of three-dimensional structure than primary sequence . This makes it difficult to develop for GPCR subtypes a comprehensive classification system based on sequence . The most commonly-used system of classification is that implemented in the GPCRDB database , which divides the GPCRs into six classes (Class A: Rhodopsin-like, with over 80% of all GPCRs in humans; Class B: Secretin-like; Class C: Metabotropic glutamate receptors; Class D: Pheromone receptors; Class E: cAMP receptors; and the much smaller Class F: Frizzled/smoothened family). Classes A, B, C and F are found in mammalian species while Class D proteins are found only in fungi and Class E proteins are exclusive to Dictyostelium. The six classes are further divided into sub-divisions and sub-sub-divisions based on the function of a GPCR and its specific ligand.
Previous attempts at classifying the GPCRs from its primary sequence have included motif-based classification tools [9, 10] and machine learning methods such as Hidden Markov Models [11, 12] and Support Vector Machines (SVMs) . Several publicly-available SVM-based GPCR classifiers exist: PRED-GPCR [14, 15], GPCR-PRED  and GPCRsclass . Some predictive techniques have used a combination of SVMs and HMMs . Other approaches towards GPCR Classification have included Self-Organising Maps , Quasi-predictor Feature Classifiers  and Decision Trees . GPCRTree is a new publicly-available server based on the idea of selecting the best classifier (from a set of candidate classifiers) at each node of the GPCR class tree.
A previously-constructed comprehensive GPCR sequences dataset was used to train and test the classifier . Proteins shorter than 280 amino acids were removed, eliminating incomplete protein sequences. All identical sequences were removed to avoid redundancy and classes with fewer than 10 examples were also removed. The dataset used to train the server contains 8222 protein sequences in 5 classes at the family level (A-E), 38 classes at the sub-family level, and 87 classes at the sub-subfamily level. Class F was not considered since it contains too few sequences to develop an accurate classification model. The system uses an alignment-independent classification system based on amino acids physical properties. Proteochemometrics uses 5 "z-values" (z1–z5) derived from 26 real physiochemical properties using principal component analysis [23, 24]. These five values are calculated for each amino acid in the sequence and are used to generate the 15 attribute values described in , giving a purely numerical description of the protein.
The GPCRTree server classifies at the GPCR Class, Subfamily and Sub-Subfamily level. Hierarchical classification of a sequence is performed using a selective top-down approach, whereby each group of sibling nodes in the GPCR class tree becomes a flat classification problem solved using a standard classifier [25, 26], obviating the need to devise a novel classifier. The full dataset trains the root classifier, while only relevant subsets of the data are used to train classifiers at the subfamily and sub-subfamily levels. When an unclassified sequence is presented to the algorithm, the root level classifier assigns it to a class, which is then passed down to an appropriate classifier at the next level until it is assigned to a subfamily and a sub-subfamily . Instead of a single classification algorithm being used at each node of the class tree, many classifiers are trained using a subset of the training set called the sub-training set, and then tested using a separate part of the training set called the validation set. The classifier with the highest classification accuracy on the validation set is selected for that node. Eight standard classification algorithms were used as candidate classifiers at each node of the GPCR tree. All code was written using the open source WEKA data mining package [28, 29] and the default parameters were used for each algorithm.
The GPCRTree server has been validated against three other predictive GPCR servers . The GPCRTree server was trained using the full GPCRtree dataset, and then tested with each GPCR server dataset as test data. GPCRTree produced accuracies of 97% at the Class level, 84% at the Sub-family and 75% at the Sub-Subfamily level. This exceeded the PRED-GPCR server at the Class level and is comparable at the Sub-family level. It exceeds the GPCRPred server at all levels of the hierarchy. The GPCRsclass server was the most successful classifier at the most specific (sub-sub-family) level; this may be because the classifier is overly specialised, being applicable only to the Class A Amine sub-subfamily level. Of servers applicable to all GPCR classes, GPCRTree is the most accurate GPCR prediction server currently available.
GPCRTree is available through a web interface – http://igrid-ext.cryst.bbk.ac.uk/gpcrtree/. It was implemented using PHP, dHTML and a java client. The PHP interface affords a simple and straightforward method to submit a protein sequence for evaluation. The code for the selective top-down approach, as previously published, required several changes to facilitate its effective integration into the server environment. Training was modified such that all GPCR proteins belonging to a class with 10 or more examples (protein sequences) were used. The algorithm then pauses and waits for input that will come as an auxiliary program making a TCP socket connection with the selective top down classifier. Upon connection, the auxiliary program will send the protein sequence to be classified and then pause. The classifier will make a prediction and then return the result. A TCP connection has been used for several reasons. It can allow multiple users to access the classifier. Separate users can run separate auxiliary programs, and so the classifier can queue these requests ensuring that only one will invoke the classifier at any given time. The remainder will be queued and serviced in the order of submission. Moreover, this architecture promotes portability. It may be necessary, for resource or security reasons, to run the classifier on different hardware. In this case, the server can invoke the auxiliary program which can connect via network connection to the separate machine running the classifier.
Where non-standard residues are included within the sequences, substitutions are made: a sequence containing a 'B' (asparagine or aspartic acid) is assigned as an asparagine 'N'; a 'Z' (glutamine or glutamic acid) is assigned as a glutamine 'Q'; and a 'U' (selenocysteine) is assigned as a cysteine 'C'. All unknown residues 'X' were assigned as alanines 'A'.
GPCR classification is among the most challenging problems in bioinformatics due to the sequence diversity of the GPCR superfamily and the uneven distribution of its various family subgroups. GPCRTree is the first server to implement an alignment-independent representation of protein sequences and is also the first to classify sequences using a classifier specifically selected for each group of sibling nodes in the GPCR functional classification tree. By selecting the best classifier (from a set of candidate classifiers) at each GPCR class tree node, the selective top-down method effectively exploits the fact that different classifiers have different biases that are more suitable for different classification problems. GPCRTree is currently the most accurate publicly-available server for the prediction of GPCR sequence classification and it utilises a simple yet robust interface that can undertake multiple classifications simultaneously.
Availability and requirements
Project name: GPCRTree
Project home page: http://igrid-ext.cryst.bbk.ac.uk/gpcrtree/
Operating system(s): Platform independent
Programming language: PHP, dHTML, Java
Other requirements: None
Any restrictions to use by non-academics: None
G protein coupled receptor
Tranmission Control Protocol
Waikato Environment for Knowledge Analysis
Support Vector Machine
The authors should like to gratefully acknowledge funding under the ESPRC grant EP/D501377/1 and the European Union ImmunoGrid project FP6-2004-IST-4 (contract no. 028069).
- Christopoulos A, Kenakin T: G protein-coupled receptor allosterism and complexing. Pharmacol Rev. 2002, 54: 323-374. 10.1124/pr.54.2.323.View ArticlePubMed
- Gether U, Asmar F, Meinild AK, Rasmussen SG: Structural basis for activation of G-protein-coupled receptors. Pharmacol Toxicol. 2002, 91: 304-312. 10.1034/j.1600-0773.2002.910607.x.View ArticlePubMed
- Bissantz C: Conformational changes of G protein-coupled receptors during their activation by agonist binding. J Recept Signal Transduct Res. 2003, 23: 123-153. 10.1081/RRS-120025192.View ArticlePubMed
- Flower DR: Modelling G-protein-coupled receptors for drug design. Biochim Biophys Acta. 1999, 1422: 207-234.View ArticlePubMed
- Klabunde T, Hessler G: Drug design strategies for targeting G-protein-coupled receptors. ChemBioChem. 2002, 3: 928-944. 10.1002/1439-7633(20021004)3:10<928::AID-CBIC928>3.0.CO;2-5.View ArticlePubMed
- Milligan G: G-protein-coupled receptor heterodimers: pharmacology, function and relevance to drug discovery. Drug Discov Today. 2006, 11: 541-549. 10.1016/j.drudis.2006.04.007.View ArticlePubMed
- Davies MN, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR: Proteomic applications of automated GPCR classification. Proteomics. 2007, 7: 2800-14. 10.1002/pmic.200700093.View ArticlePubMed
- Horn F, Bettler E, Oliveira L, Campagne F, Cohen FE, Vriend G: GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res. 2003, 31: 294-7. 10.1093/nar/gkg103.PubMed CentralView ArticlePubMed
- Attwood TK: A compendium of specific motifs for diagnosing GPCR subtypes. Trends Pharmacol Sci. 2001, 22 (4): 162-165. 10.1016/S0165-6147(00)01658-8.View ArticlePubMed
- Flower DR, Attwood TK: Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors. Semin Cell Dev Biol. 2004, 15: 693-701.View ArticlePubMed
- Wistrand M, Kall L, Sonnhammer EL: A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci. 2006, 15: 509-21. 10.1110/ps.051745906.PubMed CentralView ArticlePubMed
- Sgoruakis NG, Bagos PG, Papasaikas PK, Hamodrakas SJ: A method for GPCRs coupling specificity to G-proteins using refined profile Hidden Markov Models. BMC Bioinformatics. 2006, 6: 104-10.1186/1471-2105-6-104.View Article
- Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics. 2002, 18: 147-159. 10.1093/bioinformatics/18.1.147.View ArticlePubMed
- Papasaikas PK, Bagos PG, Litou ZI, Promponas VJ, Hamodrakas SJ: PRED-GPCR: GPCR recognition and family classification server. Nucleic Acids Res. 2004, 32: W380-382. 10.1093/nar/gkh431.PubMed CentralView ArticlePubMed
- Guo YZ, Li ML, Wang KL, Wen ZN, Lu MC, Liu LX, Lin J: Fast fourier transform-based support vector machine for prediction of G-protein coupled receptor subfamilies. Acta Biochim Biophys Sin (Shanghai). 2005, 37: 759-66. 10.1111/j.1745-7270.2005.00110.x.View Article
- Bhasin M, Raghava GP: GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res. 2004, 32: W383-9. 10.1093/nar/gkh416.PubMed CentralView ArticlePubMed
- Bhasin M, Raghava GP: GPCRsclass: a web tool for the classification of amine type of G protein-coupled receptors. Nucleic Acids Res. 2005, 33: W143-7. 10.1093/nar/gki351.PubMed CentralView ArticlePubMed
- Yabuki Y, Muramatsu T, Hirokawa T, Mukai H, Suwa M: GRIFFIN: a system for predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov model. Nucleic Acids Res. 2005, 33: W148-53. 10.1093/nar/gki495.PubMed CentralView ArticlePubMed
- Vilo J, Kapushesky M, Kemmeren P, Sarkans U, Brazma A: Expression Profiler. The Analysis of Gene Expression Data: Methodsand Software. Edited by: Parmigiani G, Garret ES, Irizarry R, Zeger SL. 2003, Springer Verlag, New York
- Kim J, Moriyama EN, Warr CG, Clyne PJ, Carlson JR: Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics. 2000, 16: 767-75. 10.1093/bioinformatics/16.9.767.View ArticlePubMed
- Huang Y, Cai J, Li L, Yanda L: Classifying G-protein coupled receptors with bagging classification tree. Computational Biology and Chemistry. 2004, 28: 275-280. 10.1016/j.compbiolchem.2004.08.001.View ArticlePubMed
- Davies MN, Gloriam DE, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR: On the hierarchical classification of G Protein Coupled Receptors. Bioinformatics. 2007, 23: 3113-3118. 10.1093/bioinformatics/btm506.View ArticlePubMed
- Lapinsh M, Prusis P, Lundstedt T, Wikberg JE: Proteochemometrics modeling of the interaction of amine G-protein coupled receptors with a diverse set of ligands. Mol Pharmacol. 2002, 61: 1465-75. 10.1124/mol.61.6.1465.View ArticlePubMed
- Freyhult E, Prusis P, Lapinsh M, Wikberg JE, Moulton V, Gustafsson MG: Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling. BMC Bioinformatics. 2005, 6: 50-10.1186/1471-2105-6-50.PubMed CentralView ArticlePubMed
- Freitas AA, de Carvalho ACPLF: A Tutorial on Hierarchical Classification with Applications in Bioinformatics. Research and Trends in Data Mining Technologies and Applications. Edited by: Taniar D. 2007, Idea Group, 175-208.View Article
- Costa EP, Lorena AC, Carvalho ACPLF, Freitas AA, Holden N: Comparing several approaches for hierarchical classification of proteins with decision trees. Proc. of the 2007 Brazilian Symposium on Bioinformatics (BSB-2007). 2007, Brazilian Symposium on Bioinformatics (BSB-2007)
- Davies MN, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR: On the hierarchical classification of G protein-coupled receptors. Bioinformatics. 2007, 23: 3113-8. 10.1093/bioinformatics/btm506.View ArticlePubMed
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2005, Morgan Kaufmann, San Francisco
- Brownlee J: WEKA Classification Algorithms, Version 1.6. [http://sourceforge.net/projects/wekaclassalgos]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.