Systematic search for putative new domain families in Mycoplasma gallisepticum genome
© Offmann et al; licensee BioMed Central Ltd. 2010
Received: 3 June 2009
Accepted: 12 April 2010
Published: 12 April 2010
Protein domains are the fundamental units of protein structure, function and evolution. The delineation of different domains in proteins is important for classification, understanding of structure, function and evolution. The delineation of protein domains within a polypeptide chain, namely at the genome scale, can be achieved in several ways but may remain problematic in many instances. Difficulties in identifying the domain content of a given sequence arise when the query sequence has no homologues with experimentally determined structure and searching against sequence domain databases also results in insignificant matches. Identification of domains under low sequence identity conditions and lack of structural homologues acquire a crucial importance especially at the genomic scale.
We have developed a new method for the identification of domains in unassigned regions through indirect connections and scaled up its application to the analysis of 434 unassigned regions in 726 protein sequences of Mycoplasma gallisepticum genome. We could establish 71 new domain relationships and probable 63 putative new domain families through intermediate sequences in the unassigned regions, which importantly represent an overall 10% increase in PfamA domain annotation over the direct assignment in this genome.
The systematic analysis of the unassigned regions in the Mycoplasma gallisepticum genome has provided some insight into the possible new domain relationships and putative new domain families. Further investigation of these predicted new domains may prove beneficial in improving the existing domain prediction algorithms.
Domain assignment to the protein sequences has paramount importance in the post genomic era. Protein domains are the structural, functional and evolutionary units of proteins. Study of proteins at the domain level has had a profound impact on the study of individual proteins. Experimental and/or computational methods can be used to identify domains in the given protein sequence. Classification databases such as the Dali Domain Dictionary, CATH, SCOP and DIAL employ structural data to locate and assign domains. Identification of domains at the sequence level depends on the detection of global and local sequence similarities between a given query sequence and domain sequences found in databases such as Pfam. Due to high evolutionary divergence, it is not always possible to identify distantly related protein domains by sequence search techniques. The realization of additional domains in those circumstances can be tedious, involving manual intervention, but can lead to better understanding of overall biological function. We have recently introduced an automatic multi-step approach, PURE for recognizing such connections[6, 7].
Mycoplasma gallisepticum causes chronic respiratory disease in chickens and other avian species. The infections result in considerable economic losses in poultry production. This pathogen has a small genome with 726 proteins , but only 498 protein sequences have known Pfam hits with 46% residue coverage . The gap in the annotation of this genome emphasizes the need for further exploration for other methods for domain assignment from sequence. We have recently shown that it is possible to enhance prediction of domains in the unassigned regions by 25% through indirect connections in the class III adenylyl cyclase domain containing proteins . Here, we demonstrate that this method can be scaled up for whole genome analysis, by taking Mycoplasma gallisepticum genome as a specific example.
Results and Discussion
Newly predicted domains in the Mycoplasma gallisepticum genome.
ATPase family associated with various cellular activities
NP_853502.1 = 9-182
Anticodon-binding domain. This domain is found valyl and leucyl tRNA synthetases. It binds to the anticodon of the tRNA.
NP_852939.1 = 397-590
NP_852935.1 = 55-218
NP_853215.1 = 620-850
ATP synthase alpha/beta chain, C terminal domain.
NP_853478.1 = 140-221
ATP synthase alpha/beta family, beta-barrel domain
NP_853438.1 = 4-126
NP_853439.1 = 2-125
Binding-protein-dependent transport system inner membrane component.
NP_853029.1 = 53-260
NP_853249.1 = 59-232
Cyclases/Histidine kinases Associated Sensory Extracellular) present in diverse receptor-like proteins with histidine kinase and nucleotide cyclase domains
NP_853387.1 = 55-582
Domain of Unknown Function 30
NP_853479.1 = 370-770
Domain of Unknown Function 31
NP_853440.1 = 220-317
NP_853441.1 = 233-337
NP_853488.1 = 220-340
LMP repeated region. Found in the LMP group of surface-located membrane proteins of Mycoplasma hominis.
NP_853333.1 = 1260-1320, 1420-1580, 1600-1760
Ferritin-like domain is one of the major non-haem iron storage proteins in animals, plants, and microorganisms
NP_852976.1 = 5-143
GMP synthase C terminal domain.
NP_852801.1 = 220-275
Helicase conserved C-terminal domain. Found in a wide variety of helicases and helicase related proteins.
NP_852813.1 = 440-530
NP_853467.1 = 660-730
Anticodon binding domain. tRNA synthetases, or tRNA ligases are involved in protein synthesis. This domain is found in histidyl, glycyl, threonyl and prolyl tRNA synthetases.
NP_852966.1 = 342-423
The helix-hairpin-helix DNA-binding motif is found to be duplicated in the central domain of RuvA.
NP_853482.1 = 589-619
NP_853386.1 = 63-92, 98-127
Helix-turn-helix domain present in a wide variety of proteins.
NP_853136.1 = 28-73
Ribonuclease R winged-helix domain. Found found at the amino terminus of Ribonuclease R and a number of presumed transcriptional regulatory proteins from archaea.
NP_853240.1 = 38-89
The S1 domain occurs in a wide range of RNA associated proteins. It is structurally similar to cold shock protein which binds nucleic acids. The S1 domain has an OB-fold structure.
NP_852895.1 = 140-210
The K homology (KH) domain was first identified in the human heterogeneous nuclear ribonucleoprotein (hnRNP) K. It is a domain of around 70 amino acids that is present in a wide variety of quite diverse nucleic acid-binding proteins.
NP_852895.1 = 333-393
NP_852865.1 = 40-248
NP_852865.1 = 320-360
Lipoprotein associated domain.
NP_852799.1 = 49-160
Multi Antimicrobial Extrusion (MATE) family function as drug/sodium antiporters.
NP_853011.1 = 364-530
NP_852906.1 = 7-185
Major Facilitator Superfamily
NP_852970.1 = 480-922
The NusB protein is involved in the regulation of rRNA biosynthesis by transcriptional antitermination.
NP_853291.1 = 13-130
Peptidase family M23
NP_853190.1 = 484-657
Phosphoglucomutase/phosphomannomutae, C-terminal domain
NP_853364.1 = 481-550
phosphotransferase system, EIIB
NP_853326.1 = 47-85
Bacterial extracellular solute-binding protein
NP_852821.1 = 1-385
NP_852814.1 = 6-181
Sigma-70 factor, region 1.1.
NP_853171.1 = 288-342
Sigma-70 factor, region 1.2
NP_853171.1 = 357-398
Sigma-70, region 4
NP_852863.1 = 120-170
ThrRS, GTPase, and SpoT domain.
NP_852968.1 = 417-487
thiouridine synthases, methylases and PSUSs domain.
NP_853282.1 = 78-170
The C-terminal domain of transketolase has been proposed as a regulatory molecule binding site
NP_852812.1 = 530-641
NP_853134.1 = 138-262
OB-fold nucleic acid binding domain
NP_852876.1 = 230-310
Virulence-associated protein D
NP_853458.1 = 7-49
Mycoplasma MG185/MG260 protein.
NP_852988.1 = 247-404
NP_852899.1 = 257-414
Putative mycoplasma lipoprotein, C-terminal region
NP_852988.1 = 444-563
Members of this family include the DEAD and DEAH box helicases. Helicases are involved in unwinding nucleic acids.
NP_852877.1 = 596-722
ABC transporter transmembrane region.
NP_852786.1 = 2-126
NP_853051.1 = 317-467
Domain of Unknown Function 258
NP_853404.1 = 7-104
NP_853200.1 = 68-151
Recombination protein O
NP_853174.1 = 1-74
NP_853298.1 = 461-889
Transposase, Mutator family
NP_852891.1 = 6-108
NP_853257.1 = 16-83
NP_852883.1 = 2-121
NP_853456.1 = 650-708
To validate the newly predicted domains, we generated multiple sequence alignments using CLUSTALW program . Inputs for multiple sequence alignment are the unassigned sequence and the representative sequences of the newly assigned domain obtained from Pfam database. In most cases, only few family-specific signature residues are conserved, suggesting extreme levels of evolutionary divergence from classical members of such Pfam families.
The number of newly predicted domains was substantial; it raises an interesting question: why were these domains not annotated in the initial search? It is likely because of the poor sequences identities between query and hit. Though sequence analysis-based remote-homology detection approaches, such as Hidden Markov Models (HMMs), are powerful tools, these methods often face limitations due to poor sequence similarities and non-uniform sequence dispersion in protein sequence space. Several interesting approaches have been employed in different ways to detect remotely related proteins; one such approach is based on the intermediate sequences. Intermediate sequence procedure substantially increases the ability to recognize the distant evolutionary relationships.
There is relatively large number (63) of unassigned regions, which has picked up at least five homologues but not associated with any PfamA domain (Additional file 1: Supplemental Table S1). We examined below few examples of these regions, which can be regarded as potentially putative new domain families. Search against PfamB profile HMM of Pfam24 database showed that 20 unassigned regions, where putative new domain families were predicted by us, were also associated with at least one PfamB domain (Additional file 2: Supplemental Table S2). However, about two-third (43 out of 63 unassigned regions) neither have a hit in the PfamA database nor in the PfamB database. These 43 regions may indicate potential new domain families, which are yet to be annotated in the Pfam database (Additional file 2: Supplemental Table S2).
Comparison of PURE predicted domains with CDD predicted domains
antocodon_1 - 55-218
MetG - 10-199
LMP - 1260-1320
LMP - 1420-1580
LMP - 1600-1760
SbcC - 1253-1856
Helicase_C - 660-730
Type I site-specific restriction-modification system- 2-1018
Lactamase_B - 40-248
RMMBL - 320-360
mRNA degradation ribonucleases-23-594
MatE - 364-530
NorM - 127-558
MFS_1 - 480-992
SecD - 359-574
THUMP - 78-170
PseudoU_synth - 112-167
DUF_258 - 7-104
YlqF - 12-174
The nature of sequence/domain searches is such that the databases are constantly going through updates and it is inevitable that our new findings might appear obsolete due to the constant updates of robust databases such as Pfam and CDD. Where there is concurrence with the newer version of a database, they serve to validate the approach. Where there is still new information obtained from PURE approach, this clearly suggests the value and novelty of the protocol due to the early realization of additional domains. When the newer entries are substantially high, this is very encouraging for the development of the approach suggesting that this has promise for discovering hitherto unidentified domains.
Here, we present the results of the application of a new method for domain identification to full genome of an avian pathogen Mycoplasma gallisepticum. In spite of filters, such as evolutionary conservation and high predicted structural content, about 20% of orphan proteins contained in this genome could be annotated with a known functional domain using our approach. Interestingly, our analysis revealed several meaningful alignments, which could relate to as yet functionally unidentified set of domains. This could be very useful as a starting subset for further functional screening in wet lab experiments. Several improvements of the methodology will be addressed in future. Furthermore, cross-genome comparisons of the results from our procedure between Mycoplasma gallisepticum and other Mycoplasma species are currently being investigated.
Complete protein sequences of Mycoplasma gallisepticum (Strain R) were obtained from National centre for Biotechnology Information website.
Extraction of Unassigned regions
Mycoplasma gallisepticum protein sequences were scanned against a dataset of Hidden Markov Models (HMMs) obtained from the PfamA database (Pfam version 21.0) which consists of 8957 families, employing the HMMpfam of the HMMER suite, with E-value threshold set to 0.1. Sequences or sequence regions, which are not associated with any domain in the above search, were considered as unassigned regions.
Filtering of Unassigned regions
To avoid false positives in the PSI-BLAST  search, we considered only unassigned sequences with at least 70 residues long.
Transmembrane regions were excluded by using HMMTOP  and coiled-coil regions by using COILS from the unassigned sequences. The above steps carried out to avoid non-specific hits in the PSI-BLAST  search.
Standalone version of protein secondary-structure prediction program PSIPRED was used to predict the α-helical, β-strand and coil (loop) content of different amino acids of unassigned regions. We employ 15% predicted secondary structural content as the minimum value, consistent with earlier work .
The unassigned sequences, which have fulfilled the filtering criteria, were used to query non-redundant sequence database  employing PSI-BLAST , with low complexity filter turned on (E value cut off 0.001), to obtain homologues.
Only regions of homologues that aligned well with the query sequence were obtained from PSI-BLAST output to scan against a dataset of Hidden Markov Models (HMMs), which were obtained from the PfamA database (Pfam version 21.0) which consists of 8957 families, employing HMMpfam of HMMER suite .
The indirect association of query sequence through homologous sequences with HMMs in the Pfam database gave rise to the definitions of full and partially associated domains. At least 75% of HMM should indirectly align with query to be considered as a fully associated domain and rest were considered as partially associated domains.
Multiple sequence alignments
Multiple sequence alignments of the query unassigned region and seed sequences of predicted domains which were obtained from Pfam were performed using CLUSTALW program . Multiple sequence alignments of the query unassigned region and hits in the PSI-BLAST were also performed when unassigned regions, where hits were obtained by PSI-BLAST search, but not in HMMpfam search. When necessary, alignments were optimized by manual editing. Phylogenetic trees were calculated using Neighbor-Joining (NJ) method . Phylogenetic tree was plotted using MEGA package .
CSRC is supported by a PhD grant from Conseil Regional de La Reunion. This work is in part supported by PPF FRROI (Bioinflam) from University of La Reunion and French Ministry of Research. BO is thankful to Conseil Regional de La Reunion for financial support. RS thanks University de La ReUnion for the visiting professorship position and NCBS for financial and infrastructural support.
- Dietmann S, Holm L: Identification of homology in protein structure classification. Nat Struct Biol. 2001, 8 (11): 953-957. 10.1038/nsb1101-953.PubMedView Article
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure. 1997, 5 (8): 1093-1108. 10.1016/S0969-2126(97)00260-8.PubMedView Article
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247 (4): 536-540.PubMed
- Sowdhamini R, Blundell TL: An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins. Protein Sci. 1995, 4 (3): 506-520.PubMed CentralPubMedView Article
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R: Pfam: clans, web tools and services. Nucleic Acids Res. 2006, D247-251. 10.1093/nar/gkj149. 34 Database
- Reddy CC, Shameer K, Offmann BO, Sowdhamini R: PURE: a webserver for the prediction of domains in unassigned regions in proteins. BMC Bioinformatics. 2008, 9: 281-10.1186/1471-2105-9-281.PubMed CentralPubMedView Article
- Reddy CC, Shameer K, Offmann BO, Sowdhamini R: PURE: A web server for querying the relationship between Pre-existing domains and Unassigned Regions in proteins. 2007, [http://www.natureprotocols.com/2007/11/01/pure_a_web_server_for_querying.php]
- Papazisi L, Gorton TS, Kutish G, Markham PF, Browning GF, Nguyen DK, Swartzell S, Madan A, Mahairas G, Geary SJ: The complete genome sequence of the avian pathogen Mycoplasma gallisepticum strain R(low). Microbiology. 2003, 149 (Pt 9): 2307-2316. 10.1099/mic.0.26427-0.PubMedView Article
- Pfam Genome Distribution website. [http://www.sanger.ac.uk/]
- Reddy CS, Manonmani A, Babu M, Sowdhamini R: Enhanced structure prediction of gene products containing class III adenylyl cyclase domains. In Silico Biol. 2006, 6 (5): 351-362.PubMed
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.PubMedView Article
- Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 1998, 26 (1): 320-322. 10.1093/nar/26.1.320.PubMed CentralPubMedView Article
- Rost B, Sander C, Schneider R: Redefining the goals of protein secondary structure prediction. J Mol Biol. 1994, 235 (1): 13-26. 10.1016/S0022-2836(05)80007-5.PubMedView Article
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralPubMedView Article
- Park J, Teichmann SA, Hubbard T, Chothia C: Intermediate sequences increase the detection of homology between sequences. J Mol Biol. 1997, 273 (1): 349-354. 10.1006/jmbi.1997.1288.PubMedView Article
- Nakatsu T, Kato H, Oda J: Crystal structure of asparagine synthetase reveals a close evolutionary relationship to class II aminoacyl-tRNA synthetase. Nat Struct Biol. 1998, 5 (1): 15-19. 10.1038/nsb0198-15.PubMedView Article
- Kyrpides NC, Woese CR, Ouzounis CA: KOW: a novel motif linking a bacterial transcription factor with ribosomal proteins. Trends Biochem Sci. 1996, 21 (11): 425-426. 10.1016/S0968-0004(96)30036-4.PubMedView Article
- Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 2002, 30 (1): 281-283. 10.1093/nar/30.1.281.PubMed CentralPubMedView Article
- National center for Biotechnology Information web site. [http://www.ncbi.nlm.nih.gov/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView Article
- Tusnady GE, Simon I: The HMMTOP transmembrane topology prediction server. Bioinformatics. 2001, 17 (9): 849-850. 10.1093/bioinformatics/17.9.849.PubMedView Article
- Lupas A, Van Dyke M, Stock J: Predicting coiled coils from protein sequences. Science. 1991, 252 (5010): 1162-1164. 10.1126/science.252.5009.1162.PubMedView Article
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16 (4): 404-405. 10.1093/bioinformatics/16.4.404.PubMedView Article
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006, D173-180. 10.1093/nar/gkj158. 34 Database
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-425.PubMed
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24 (8): 1596-1599. 10.1093/molbev/msm092.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.