HMM-ModE: implementation, benchmarking and validation with HMMER3
© Sinha and Lynn; licensee BioMed Central Ltd. 2014
Received: 12 December 2013
Accepted: 21 July 2014
Published: 30 July 2014
HMM-ModE is a computational method that generates family specific profile HMMs using negative training sequences. The method optimizes the discrimination threshold using 10 fold cross validation and modifies the emission probabilities of profiles to reduce common fold based signals shared with other sub-families. The protocol depends on the program HMMER for HMM profile building and sequence database searching. The recent release of HMMER3 has improved database search speed by several orders of magnitude, allowing for the large scale deployment of the method in sequence annotation projects. We have rewritten our existing scripts both at the level of parsing the HMM profiles and modifying emission probabilities to upgrade HMM-ModE using HMMER3 that takes advantage of its probabilistic inference with high computational speed. The method is benchmarked and tested on GPCR dataset as an accurate and fast method for functional annotation.
The implementation of this method, which now works with HMMER3, is benchmarked with the earlier version of HMMER, to show that the effect of local-local alignments is marked only in the case of profiles containing a large number of discontinuous match states. The method is tested on a gold standard set of families and we have reported a significant reduction in the number of false positive hits over the default HMM profiles. When implemented on GPCR sequences, the results showed an improvement in the accuracy of classification compared with other methods used to classify the familyat different levels of their classification hierarchy.
The present findings show that the new version of HMM-ModE is a highly specific method used to differentiate between fold (superfamily) and function (family) specific signals, which helps in the functional annotation of protein sequences. The use of modified profile HMMs of GPCR sequences provides a simple yet highly specific method for classification of the family, being able to predict the sub-family specific sequences with high accuracy even though sequences share common physicochemical characteristics between sub-families.
Heuristic methods like BLAST  and FASTA  are commonly employed for the task of assigning function to a protein sequence on the basis of sequence similarity. In many cases, where a close homolog with known sequence and structure is not known, the sequence-sequence comparison methods show poor sensitivity. Profile Hidden Markov Model (HMM) like HMMER  and SAM  provide increased sensitivity in detecting remote homologs as sequence profiles are a better representation of a set of homologous sequences than a single sequence. Profile HMMs however have poor specificity in case of protein families with closely related function because of the high probability of selecting sequences from other sub families based on fold signals common to the family. The Pfam database  uses curated thresholds as an additional aid to the E-value: a Trusted Cutoff (TC), a Noise Cutoff (NC) and a Gathering threshold (GA), where TC > GA > NC. These criteria do not hold uniformity when applied to pre-classified positive and negative training sequence data because there may be cases in which negative sequences have higher scores than positive sequences. Earlier work, though not in wide spread use, have attempted to minimize the effect of non- discriminating residues for fold and function level classification through varied concepts like using negative examples for training HMMs, through the modification of emission  and transition probabilities  and through using positional entropy  to classify sequences both at fold and function level. HMM-ModE  is a method that generates family specific profile HMMs, through HMMER, by optimizing the discrimination threshold using the mode of average MCC (Mathews correlation coefficient) distribution from 10-fold cross validation and modifying the emission probabilities using negative training sequences. The protocol is much faster in training because only the sequences selected as false positives by the subfamily HMM, are used to modify model parameters and optimize the discrimination threshold. It provides a significant improvement over the existing methods for classification of fold and function specific signals. Another reason for the limited use of profile HMMs in primary annotation of functions is that the algorithms for database searching were significantly slower than heuristic local alignment methods. Recently, HMMER has been upgraded to a new version which improves database search speed by several orders of magnitude . We have implemented our method to use the recent release of HMMER. This new version of HMM-ModE with HMMER3 will be useful for the large scale deployment in sequence annotation projects. In this work, the performance of the method is benchmarked using both HMMER2 and HMMER3, and validated on a set of pre-classified enzyme superfamilies which are clustered according to specific sequence, structure and functional criteria to be used as a gold standard in family and superfamily clustering methods . In addition, we have also compared all the results reported in this manuscript with the earlier version of the method.
The separation of the fold and function specific signals is important and is a powerful way for classification when the sequences are classified into families and subfamilies as evident from GPCRs, protein kinase and Enzyme classification. Another similar method and resource for automated sub-family classification and identification was developed earlier through a three stage process, estimating a functional hierarchy for each protein family and sub families; using Hidden Markov Models to model both the family-defining and sub-family defining signatures; and using sub-family HMMs to assign novel sequences to functional subtypes . Our focus has been to improve the specificity for any dataset which is classified in a hierarchical fashion. One of the examples for such dataset is G protein coupled receptors (GPCRs).
GPCRs constitute a large family of integral membrane proteins. They form a unique modular system for signal transduction thereby allowing transmission of various signals across and between cells. The name GPCR reflects its involvement in the process of receptor signalling in the cellular environment via GTP binding proteins. They are known to mediate a variety of cellular and physiological signals and are also known as seven transmembrane (7TM) receptors . Found only in eukaryotes, they are an important family of proteins both at the physiological level, where they mediate functions like signal transduction, and at the pharmacological level, serving as important drug targets. Therefore, much of the effort at the research level is now focused on the development of methods for accurate classification of GPCRs. However, since GPCRs are known to have functional as well as sequential diversity hence their classification is a daunting task.
In recent times, one of the most comprehensive and widely used classifications for the GPCR families is provided by GPCRDB . The resource classifies GPCRs into five classes: Class A is the Rhodopsin like, Class B is Secretin like, and Class C is the Metabotropic glutamate/pheromone, Vomeronasal receptors (V1R and V3R) and Taste receptors (T2R). These classes are further classified into sub-families and sub- subfamilies based on the function of GPCR protein and the substrate specificity.
As expected from the growing interest of both academic and industrial researchers, several methods have been proposed for the prediction and classification of GPCRs. Earlier methods made use of covariant discriminant algorithm  and bagging classification tree  to classify GPCR sequences based on their amino acid composition. The discrimination power of machine learning techniques, like support vector machines, had also been used to classify GPCRs at different levels . GPCRPred  implemented SVM to classify GPCR sequences based on their dipeptide composition. There are other methods that are implemented on the transmembrane pattern analysis as these regions are known to be conserved across GPCRs of similar functionality . Hidden Markov models have also been utilized to predict and classify GPCR sequences [20, 21]. A majority of these methods use the amino acid composition or the dipeptide composition for classification, or both . A recent method PCA GPCR , has used a comprehensive set of 1497 sequence derived features to classify GPCRs into different levels.
In the present study, we have also made an attempt to compare the accuracy of this new version of HMM-ModE to the existing methods available for the classification of GPCRs. We have used the dataset used by the method PCA-GPCR  for the comparative analysis.
Results and discussion
Benchmarking of the method using HMMER2 and HMMER3
An immediate concern in the implementation of the HMM-ModE protocol with HMMER3 is that this version has only local-local alignments. HMM-ModE can improve signals normally associated with substrate specificity which are differentially conserved in protein superfamilies, and should implicitly benefit from global or “glocal” (align a complete model to a subsequence of the target) alignments.
The migration of the method to newer version of HMMER has the advantage of higher search speed of execution which is a driving feature in large scale functional annotation. The speed improvements using HMMER3 is evident from a case study, where the HMM-ModE profiles of AGC kinase protein sub families were scanned against the Uniprot database using 'hmmsearch’ from HMMER2 and HMMER3 as shown in the Additional file 1. However, besides conviction in the accuracy and the size of the training dataset, which was the criteria employed with the HMM-ModE protocol used with HMMER2 to assign confidence in the profile, it is recommended that the specificity values reported during profile building be used as an additional criterion in the use of HMM-ModE protocol with HMMER3. Low specificity during profile building will point to cross-specificity with other families in the training set, where the threshold identified using the protocol may not be sufficient to properly discriminate the families during a classification exercise.
Case study for performance evaluation and validation
Performance evaluation of the new version of our method HMM-ModE on ‘gold standard’ dataset and comparison with HMM-ModE/HMMER2 
HMM-ModE with HMMER3 (HMMER2)
Vicinal Oxygen chelate
It is observed, from Table 1, that the specificity increased in four of the five superfamilies tested - the Amino Hydrolase, Enolase, Crotonase and Haloacid Dehydrogenase superfamilies. In most cases there was sufficient difference in the scores of the positive training sequences and the sequences from other classes to calculate a discriminating threshold which provided a specificity of 1.0 with the test data. In the case of Enoyl-CoA hydratase, there were sufficient common match states between related sub-families to provide a case study where the HMM-ModE protocol could improve discrimination by separating positive and negative sequences through dampening from match states differentially conserved between sub-families. The case of the Vicinal Oxygen chelate superfamily provides two cases, 2,3-Dihydroxybiphenyl dioxygenase and Catechol 2,3-dioxygenase where a larger number of negative training sequences score higher than members of the sub-family used as positive training sequences. In these cases, there will be a trade-off between the sensitivity and specificity, as no perfect discrimination can be made in classifying these sequences. However, in general, the present findings lead us to conclude that HMM-ModE tends to reduce the number of false positives without significantly affecting the true positive sequences for their classification into fold (superfamily) and function (family) respectively. These results are significant on this new dataset and complement our previous findings on AGC kinase and GPCR datasets .
Application on GPCR datasets
The GPCR profile HMMs generated, as discussed in Methods, are combined to make a profile database that serve as a resource for classifying novel GPCR sequences at sub family level. Each of these profiles is provided with a discrimination threshold generated during cross validation (See Methods). We have used different datasets (available as Additional file 2), to evaluate and compare our method with the existing ones as described in the following subsections. In addition we have also used the previous version of the method in order to show that both the versions perform equally well but the newer version offers more speed due to the inclusion of very fast and accurate hmm searches.
The table shows the comparison of our method with other methods and with HMM-ModE/HMMER2  to classify D167 dataset
The D167 dataset have sequences belonging to Acetylcholine, Adrenoceptors, dopamine and Serotonin sub families from Amine family of Class A Rhodopsin like GPCR. Except for Acetylcholine, the performance of our method is better or equally good for each of these subfamilies. However it is pertinent to note that this is further improved after verification of the sequences in the dataset. Two out of the 31 Acetylcholine sequences not identified as true positive by our method, were found to belong to the histamine subfamiliy from GPCRDB. As the dataset was not annotated with the original accession number of the sequences, we retrieved this information from UniprotKB. The results infer these to have the same Q9QYN8 accession (HRH3_RAT) which is a reviewed entry in UniprotKB annotated as Histamine H3 receptor (Rattus norvegicus) and are the same sequence. The effective number of sequences in the Acetylcholine set, therefore, would be 29 all of which are classified accurately by our method giving 100% accuracy for this sub family as well. This also indicates that the use of modified profile HMMs for the classification of GPCR sequences aid in identification of misclassified sequences as well.
The table shows the comparison of our method with PCA-GPCR and with HMM-ModE/HMMER2  to classify D566 dataset
D1238 and D365 dataset
Comparison with PCA-GPCR and HMM-ModE/HMMER2  to classify D1238 and D365 dataset
This table shows the performance of our method, HMM-ModE/HMMER2  and PCA_GPCR on GPCR_human dataset
We have upgraded the HMM-ModE method to use HMMER3 and shown its importance to functionally annotate sequences belonging to hierarchically classified data. In general, the use of only two classes, positive and negative in training, reduces all sequences not belonging to the positive class into the negative class. With large datasets, the negative training probabilities would tend to be the same as the null probabilities. As negative training data is significantly larger in size than positive training data, the speed of implementation of the method HMM-ModE improves by only selecting false positives from the negative training data, thus limiting its size to those sequences that significantly influence discrimination. The protocol has now been implemented for use with HMMER3, which will permit large scale sequence classification projects through its improved speed. However, besides curated and sufficiently large training sequence datasets, it is recommended that the specificity reported during training be used as a caution in assigning confidence to a profile in such an exercise.
Dataset for benchmarking the method HMM-ModE with HMMER3
The hierarchically classified dataset of ENZYME database is used for the comparison of the present version of HMM-ModE. We have used 19Jan2010 and 19Feb2014 release of ENZYME database. The HMM-ModE profiles of 19Jan2010 were built using the in-house method ModEnzA , which is an implementation of HMM-ModE with HMMER2 for accurate identification of enzymes. In order to benchmark the method using both versions of HMMER, the dataset was filtered to only include enzyme classes which (i) did not have any change in their size between the two ENZYME database releases (ii) did not contain any fragment sequences and (iii) where the sensitivity was 1 using both default HMMER2 and HMMER3. The 416 enzymes that meet with the above criteria is listed in Additional file 4. Profiles for these enzyme sequences were rebuilt with HMMER3 using the 19Jan2010 ENZYME dataset for direct comparison.
To further validate the discriminating power of the HMM-ModE protocol, a gold standard dataset previously described  was pruned to include only families with more than 10 sequences and remove silver standard sequences which were also included in the file. This modified dataset is provided as Additional file 5.
Dataset to construct GPCR profiles using HMM-ModE
A set of labelled GPCR sequences were downloaded from http://www.uniprot.org/docs/7tmrlist that are classified based on receptor ligand relationship. The total number of sequences in the release 2012_05 of 16-May-2012 is 3071. In order to maintain only a set of well characterized sequences for constructing the profiles, we have reduced the dataset by applying a couple of filters. Firstly, the sequences that are putative, hypothetical or predictive are removed from the dataset. Secondly, only the sequences with seven transmembranes were kept for further analysis and the presence of these seven-transmembrane helices in the reduced dataset was confirmed by GPCR-HMM . Thereafter, we have used 2110 sequences, which belong to different GPCR subfamilies and each of these sub families contain at least three sequences. A table having a list of subfamilies taken for training purpose along with number of sequences in each of these subfamilies is provided as Additional file 6. In cases, where the number of sequences in a subfamily is small (~10 or less), it is advised to mine similar sequences using BLAST  and then create the HMM profile using our method for annotation tasks.
Dataset for comparative analysis
We have used D167 and D566 dataset from one of the recent method, PCA-GPCR, for classification of GPCR sequences which consist of subfamilies classified on the basis of substrate specificity. We have also used two other datasets D1238 and D365, from the same resource which contain sequences at the class level having a broader functional classification. A new dataset, GPCR_human, have been created from the GPCRDB database  to test the respective methods. This dataset contains sequences belonging to Homo sapiens and includes Muscarinic Acetylcholine, Adrenoceptors, Dopamine, Histamine, Serotonin and Trace amine subfamilies having 11, 24, 17, 16, 26 and 23 sequences respectively.
Procedure to construct HMM-ModE profiles of GPCR proteins
In order to generate highly specific profile HMMs from the curated GPCR dataset, the sequences of each subfamily were aligned separately using Praline-TM . The subsequent alignments of each subfamily were combined to create a master alignment using MAFFT  profile-profile alignment. For each sub-family, a HMM profile is generated from the alignment of its sequences which is then used to identify the false positive sequences from rest of the sub-families. The True Positives (TPs) are defined as the members belonging to a particular subfamily while the sequences belonging to different subfamilies which are picked up by the TP subfamily when scanned across all the sequences, are categorized as False Positives (FPs). Having known the TPs and FPs, the master alignment is then used to retrieve the TP and FP alignments. The purpose of using the master alignment is to ensure that the multiple alignment columns are comparable between the corresponding HMM match states for the true and false positive profiles. This is a critical requirement, as the emission probabilities from the corresponding columns are extracted directly from the HMM profile for calculation of relative entropy, and the resultant modification of emission probability. The TP and the FP alignment are subsequently used to identify the discriminating positions for fold specific signals. The emission probabilities corresponding to these positions are modified using relative entropy as discussed in our earlier work . The profiles modified in terms of changed emission probabilities are used with a discrimination threshold, which is the value for the threshold used in profile HMM built from HMMER3 , generated through tenfold cross validation. The use of this cut off enables the profile to make highly specific classification at a finer functional level like subfamily. These profiles along with the discrimination threshold are made available as Additional file 7.
Availability of supporting data
The data used in the manuscript is provided as following Additional files.
The authors wish to thank, Jianyi Yang, for providing the GPCR dataset used in this manuscript for the comparative study and Dhwani Desai, Soumyadeep Nandi and Prashant Srivastava for providing their original scripts for the method HMM-ModE. The authors also wish to acknowledge the anonymous reviewers for suggesting critical improvements to the manuscript.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2.PubMedView ArticleGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988, 85: 2444-2448. 10.1073/pnas.85.8.2444.PubMedPubMed CentralView ArticleGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.PubMedView ArticleGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998, 14: 846-856. 10.1093/bioinformatics/14.10.846.PubMedView ArticleGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301. 10.1093/nar/gkr1065.PubMedPubMed CentralView ArticleGoogle Scholar
- Mamitsuka H: A learning method of hidden Markov models for sequence discrimination. J Comput Biol. 1996, 3: 361-373. 10.1089/cmb.1996.3.361.PubMedView ArticleGoogle Scholar
- Wistrand M, Sonnhammer ELL: Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol. 2004, 338: 847-854. 10.1016/j.jmb.2004.03.023.PubMedView ArticleGoogle Scholar
- Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol. 2000, 303: 61-76. 10.1006/jmbi.2000.4036.PubMedView ArticleGoogle Scholar
- Srivastava P, Desai D, Nandi S, Lynn A: HMM-ModE–Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics. 2007, 8: 104-10.1186/1471-2105-8-104.PubMedPubMed CentralView ArticleGoogle Scholar
- Eddy SR: Accelerated profile HMM searches. PLoS Comput Biol. 2011, 7: e1002195-10.1371/journal.pcbi.1002195.PubMedPubMed CentralView ArticleGoogle Scholar
- Brown SD, Gerlt JA, Seffernick JL, Babbitt PC: A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 2006, 7: R8-10.1186/gb-2006-7-1-r8.PubMedPubMed CentralView ArticleGoogle Scholar
- Brown DP, Krishnamurthy N, Sjolander K: Automated protein sub family identification and classification. PLoS Comput Biol. 2007, 3 (8): e160-10.1371/journal.pcbi.0030160.PubMedPubMed CentralView ArticleGoogle Scholar
- Venkatakrishnan AJ, Deupi X, Lebon G, Tate CG, Schertler GF, Babu MM: Molecular signatures of G-protein-coupled receptors. Nature. 2013, 494: 185-194. 10.1038/nature11896.PubMedView ArticleGoogle Scholar
- Vroling B, Sanders M, Baakman C, Borrmann A, Verhoeven S, Klomp J, Oliveira L, Vlieg JD, Vriend G: GPCRDB: information system for G protein-coupled receptors. Nucleic Acids Res. 2010, 39: D309-D319.PubMedPubMed CentralView ArticleGoogle Scholar
- Elrod WD, Chou KC: A study on the correlation of G-protein-coupled receptor types with amino acid composition. Protein Eng. 2002, 15: 713-715. 10.1093/protein/15.9.713.PubMedView ArticleGoogle Scholar
- Huang Y, Cai J, Ji L, Li Y: Classifying G-protein coupled receptors with bagging classification tree. Comput Biol Chem. 2004, 28: 275-280. 10.1016/j.compbiolchem.2004.08.001.PubMedView ArticleGoogle Scholar
- Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics. 2002, 18: 147-159. 10.1093/bioinformatics/18.1.147.PubMedView ArticleGoogle Scholar
- Bhasin M, Raghava GPS: GPCRpred: an SVM-based method for prediction of families and subfamilies of G- protein coupled receptors. Nucleic Acids Res. 2004, 32: W383-W389. 10.1093/nar/gkh416.PubMedPubMed CentralView ArticleGoogle Scholar
- Inoue Y, Ikeda M, Shimizu T: Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput Biol Chem. 2004, 28: 39-49. 10.1016/j.compbiolchem.2003.11.003.PubMedView ArticleGoogle Scholar
- Papasaikas PK, Bagos PG, Litou ZI, Promponas VJ, Hamodrakas SJ: PRED-GPCR: GPCR recognition and family classification server. Nucleic Acids Res. 2004, 32: W380-W382. 10.1093/nar/gkh431.PubMedPubMed CentralView ArticleGoogle Scholar
- Qian B, Soyer OS, Neubig RR, Goldstein RA: Depicting a protein’s two faces: GPCR classification by phylogenetic tree-based HMMs. FEBS Lett. 2003, 554: 95-99. 10.1016/S0014-5793(03)01112-8.PubMedView ArticleGoogle Scholar
- Gao QB, Wang ZZ: Classification of G-protein coupled receptors at four levels. Protein Eng Design Select. 2006, 19: 511-516. 10.1093/protein/gzl038.View ArticleGoogle Scholar
- Peng ZL, Yang JY, Chen X: An improved classification of G-protein-coupled receptors using sequence-derived features. BMC Bioinformatics. 2010, 11: 420-10.1186/1471-2105-11-420.PubMedPubMed CentralView ArticleGoogle Scholar
- Desai DK, Nandi S, Srivastava PK, Lynn AM: ModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinformatics. 2011, 2011: 743782-PubMedPubMed CentralView ArticleGoogle Scholar
- Gao QB, Wu CC, Ma XQ, Lu J, He J: Classification of amine type G-protein coupled receptors with feature selection. Protein Pept Lett. 2008, 15: 834-842. 10.2174/092986608785203755.PubMedView ArticleGoogle Scholar
- Davies MN, Gloriam DE, Secker A, Freitas AA, Mendao M, Timmis J, Flower DR: Proteomic applications of automated GPCR classification. Proteomics. 2007, 7 (16): 2800-2814. 10.1002/pmic.200700093.PubMedView ArticleGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMedPubMed CentralView ArticleGoogle Scholar
- Eddy SR: A new generation of homology search tools based on probabilistic inference. Genome Inform. 2004, 23: 205-211.Google Scholar
- Wistrand M, Kall L, Sonnhammer ELL: A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci. 2006, 15: 509-521. 10.1110/ps.051745906.PubMedPubMed CentralView ArticleGoogle Scholar
- Pirovano W, Feenstra KA, Heringa J: PRALINE-TM: a strategy for improved multiple alignment of transmembrane proteins. Bioinformatics. 2008, 24: 492-497. 10.1093/bioinformatics/btm636.PubMedView ArticleGoogle Scholar
- Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002, 30: 3059-3066. 10.1093/nar/gkf436.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.