Systematic search for putative new domain families in Mycoplasma gallisepticum genome

Background Protein domains are the fundamental units of protein structure, function and evolution. The delineation of different domains in proteins is important for classification, understanding of structure, function and evolution. The delineation of protein domains within a polypeptide chain, namely at the genome scale, can be achieved in several ways but may remain problematic in many instances. Difficulties in identifying the domain content of a given sequence arise when the query sequence has no homologues with experimentally determined structure and searching against sequence domain databases also results in insignificant matches. Identification of domains under low sequence identity conditions and lack of structural homologues acquire a crucial importance especially at the genomic scale. Findings We have developed a new method for the identification of domains in unassigned regions through indirect connections and scaled up its application to the analysis of 434 unassigned regions in 726 protein sequences of Mycoplasma gallisepticum genome. We could establish 71 new domain relationships and probable 63 putative new domain families through intermediate sequences in the unassigned regions, which importantly represent an overall 10% increase in PfamA domain annotation over the direct assignment in this genome. Conclusions The systematic analysis of the unassigned regions in the Mycoplasma gallisepticum genome has provided some insight into the possible new domain relationships and putative new domain families. Further investigation of these predicted new domains may prove beneficial in improving the existing domain prediction algorithms.


Background
Domain assignment to the protein sequences has paramount importance in the post genomic era. Protein domains are the structural, functional and evolutionary units of proteins. Study of proteins at the domain level has had a profound impact on the study of individual proteins. Experimental and/or computational methods can be used to identify domains in the given protein sequence. Classification databases such as the Dali Domain Dictionary [1], CATH [2], SCOP [3] and DIAL [4] employ structural data to locate and assign domains. Identification of domains at the sequence level depends on the detection of global and local sequence similarities between a given query sequence and domain sequences found in databases such as Pfam [5]. Due to high evolutionary divergence, it is not always possible to identify distantly related protein domains by sequence search techniques. The realization of additional domains in those circumstances can be tedious, involving manual intervention, but can lead to better understanding of overall biological function. We have recently introduced an automatic multi-step approach, PURE for recognizing such connections [6,7].
Mycoplasma gallisepticum causes chronic respiratory disease in chickens and other avian species. The infections result in considerable economic losses in poultry production. This pathogen has a small genome with 726 proteins [8], but only 498 protein sequences have known Pfam hits with 46% residue coverage [9]. The gap in the annotation of this genome emphasizes the need for further exploration for other methods for domain assignment from sequence. We have recently shown that it is possible to enhance prediction of domains in the unassigned regions by 25% through indirect connections in the class III adenylyl cyclase domain containing proteins [10]. Here, we demonstrate that this method can be scaled up for whole genome analysis, by taking Mycoplasma gallisepticum genome as a specific example.

Results and Discussion
The procedure that was followed in this study and overall results are described in Figure 1. Initially, there are 726 proteins in the Mycoplasma gallisepticum genome; domains were assigned to the genome with HMMpfam program of HMMER package [11] by search against Pfam 21 databases [5,12] with expectation value cut off of 0.01. Only PfamA families are used in HMMpfam search, since PfamB families were generated automatically and have no associated annotation, references and relatively low quality data than PfamA families [12]. The regions in protein sequences, which are associated with domains, are called 'assigned regions'; rest of the regions without PFAM domain assignment are called 'unassigned regions'. There were 620 unassigned regions in the 726 proteins. To avoid false positives in the similarity search, the unassigned regions with at least 70 residues long were considered. In our analysis, 434 unassigned regions have at least 70 residues long. Transmembrane, coiled-coils and low complexity regions were excluded and only 364 sequences passed through this filtering step. A further filter was placed to ensure adequate secondary structural content giving rise to 359 sequences. 15% cut-off on predicted structural content was chosen as the minimum value, consistent with our earlier work [13]. Only 230 out of 359 unassigned regions picked up at least two hits in PSI-BLAST search; full-length sequences for all the non-self hits were obtained and used as query for HMMpfam search to assign domains to the sequences. Only 62 of 230 unassigned regions were associated with PfamA domains through homologous sequences and remaining 168 unassigned sequences failed to pick up domain associations. These 62 unassigned sequences correspond to 71 newly Figure 1 Methodology of the approach and statistics. Initially, 726 protein sequences were considered from Mycoplasma gallisepticum genome, which have 620 unassigned regions of different lengths. 434 unassigned regions are at least 70 residues long. Out of 434, only 364 passed through transmembrane and coiled coil filtering and 359 sequences after secondary structure filtering. The remaining unassigned regions (359) sequences were subject to PSI-BLAST searches, but only 230 unassigned regions picked up at least two hits. We extracted full-length sequences for each hit in PSI-BLAST and used for HMMpfam search. Here again, only 62 unassigned regions were associated indirectly with preexisting domains which correspond to 48 different domain families.

S1
The S1 domain occurs in a wide range of RNA associated proteins. It is structurally similar to cold shock protein which binds nucleic acids. The S1 domain has an OB-fold structure. predicted domains out of which 58 were fully associated and 13 were partially associated. In the fully associated domains, at least 75% of Pfam domain should lie within the unassigned region, otherwise it is called as 'partially associated domain'. These 71 newly predicted domains belong to 48 unique domain families, but among the newly predicted 48 unique domain families, only 22 unique domains were initially not present in the Mycoplasma gallisepticum genome; for the remaining 26 domains, one or more copies already exist in the genome. Among the new predictions, 22 domains appear specific to M. gallisepticum genome, only 15 domains were present in other Mycoplasmataceae members and remaining 7 domains are unique to the entire Mycoplasmataceae family members (Table 1).
To validate the newly predicted domains, we generated multiple sequence alignments using CLUSTALW program [14]. Inputs for multiple sequence alignment are the unassigned sequence and the representative sequences of the newly assigned domain obtained from Pfam database. In most cases, only few family-specific signature residues are conserved, suggesting extreme levels of evolutionary divergence from classical members of such Pfam families.
The number of newly predicted domains was substantial; it raises an interesting question: why were these domains not annotated in the initial search? It is likely because of the poor sequences identities between query and hit. Though sequence analysis-based remote-homology detection approaches, such as Hidden Markov Models (HMMs), are powerful tools, these methods often face limitations due to poor sequence similarities and non-uniform sequence dispersion in protein sequence space. Several interesting approaches have been employed in different ways to detect remotely related proteins; one such approach is based on the intermediate sequences. Intermediate sequence procedure  substantially increases the ability to recognize the distant evolutionary relationships [15]. There is relatively large number (63) of unassigned regions, which has picked up at least five homologues but not associated with any PfamA domain (Additional file 1: Supplemental Table S1). We examined below few examples of these regions, which can be regarded as potentially putative new domain families. Search against PfamB profile HMM of Pfam24 database showed that 20 unassigned regions, where putative new domain families were predicted by us, were also associated with at least one PfamB domain (Additional file 2: Supplemental Table S2). However, about two-third (43 out of 63 unassigned regions) neither have a hit in the PfamA database nor in the PfamB database. These 43 regions may indicate potential new domain families, which are yet to be annotated in the Pfam database (Additional file 2: Supplemental Table S2).
The NP_853398.1 protein (Figure 2, Figure 3) sequence, which is 329 residues long, has a single asparagine synthetase (AsnA) domain from 5 to 241 residues and one C-terminus unassigned region from 242 to 329. When the unassigned region of this protein (242-329) was analyzed, based on the intermediate sequences using the methodology described above, 106 similar sequences were identified in the PSI-BLAST search and these hit sequences were from both prokaryotes and eukaryotes. In the HMMpfam search, however, it was not associated with any PfamA domain, rather it was associated with PfamB_3316 domain. This unassigned region has about 62% predicted secondary structural content with 5 helices and 3 βstrands. More interestingly, the predicted β-3α-β-α-βα secondary structure pattern is conserved in all the homologous sequences. All the homologous sequences have similar domain architecture. The crystal structure of E.coli asparagine synthetase also showed the presence of this small subdomain [16]. Aspartate-ammonia ligase (asparagine synthetase) catalyses the conversion of L-aspartate to L-asparagine in the presence of ATP and ammonia. AsnA structure revealed that AsnA structure is similar to that of the catalytic domain of yeast aspartyl-tRNA synthetase despite low sequence similarity. These enzymes have a common reaction mechanism that implies the formation of an aminoacyl-adenylate intermediate. The cluster of highly conserved residues (GGGIG) motif plays an important role in the formation of a cavity which can accommodate bound ATP in aspartyl-tRNA synthetase [16]. Since this motif is conserved in the newly predicted putative domain, it may play an important role in the ligand binding. In another example, NP_853462.1 protein has an unassigned region  in the N-terminal region followed by a C-terminal existing Peptidase_S8 domain. All the PSI-BLAST hits were from prokaryotes. The alignment ( Figure 4) and conservation of secondary structures in this region suggests the existence of functionally unidentified domain that could augment the basic peptidase activity. It is also interesting to note that all the homologous sequences have similar domain architectures.
In another protein NP_852844.1, we had analysed an unassigned region from residues 38 to 109 and the gene product already was associated with KOW domain at the N-terminus (from 5 to 37 residues). 109 homologues could be identified by PSI-BLAST and all homologues belong to prokaryotic organisms ( Figure 5). All the homologous sequences have similar predicted secondary structure content. Most of the homologues also have similar domain architecture with N-terminal KOW motif and C-terminus as an unassigned region. KOW motif is only about 35 residues long and links a bacterial transcription factor with ribosomal proteins [17]. The presence of conserved residues, with twice the size of KOW motif at the C-terminal region, suggests the functional role of C-domain in additionally stabilizing the oligomeric assemblies and thereby perhaps contributing to improved efficiency of protein expression.
The results presented in this paper, are based on Pfam21 database with 8957 protein domain families; updated version of Pfam database (Pfam23) has 10340 domain families. Comparison of 71 newly predicted domains in the 62 unassigned regions to the updated Pfam23 database revealed that 67 domains, out of 71, still remain unassigned in the Pfam23 database. Among four Pfam23 annotations, three are the same as our earlier prediction and one annotation differs from our prediction. Comparison of our results with conserved domain database (CDD v2.16) [18] revealed that 34 out of 71 unassigned regions remained unassigned in the CDD database also, whereas remaining 37 regions are annotated as domains. Out of these 37 annotated domains in the CDD database, 25 domains are the same as predicted by our method described above. Assignments which differ from the new Pfam database (Pfam23) and CDD (seeming false positives and false negatives) were analyzed further. 12 out of 71 PURE predicted domains differ when compared to CDD database ( Table 2); These differences could arise due to inherent differences in the methodologies of CDD and Pfam database construction. When compared with Pfam23 database, out of 71 PURE predictions, 67 still remain unassigned and three agreed up on the PURE predictions. But, there is one disagreement: 124 residue-long protein NP_853458.1 was full-length unassigned sequence and PURE method assigned 'VapD_N' domain with E value of 0.006. In the Pfam23 database, 'CRISPR_Cas2' domain is assigned to the sequence with E value of 3.4 e -47 . This   could be due to the short length of 'VapD_N' (40 residues long) domain alignment and the borderline Evalue (0.006 and a cut-off 0.01). Moreover, the 'CRISPR_Cas2' (PF09827) domain family is defined only in Pfam23, but not in Pfam21 database. The nature of sequence/domain searches is such that the databases are constantly going through updates and it is inevitable that our new findings might appear obsolete due to the constant updates of robust databases such as Pfam and CDD. Where there is concurrence with the newer version of a database, they serve to validate the approach. Where there is still new information obtained from PURE approach, this clearly suggests the value and novelty of the protocol due to the early realization of additional domains. When the newer entries are substantially high, this is very encouraging for the development of the approach suggesting that this has promise for discovering hitherto unidentified domains.

Conclusions
Here, we present the results of the application of a new method for domain identification to full genome of an avian pathogen Mycoplasma gallisepticum. In spite of filters, such as evolutionary conservation and high predicted structural content, about 20% of orphan proteins contained in this genome could be annotated with a known functional domain using our approach. Interestingly, our analysis revealed several meaningful alignments, which could relate to as yet functionally unidentified set of domains. This could be very useful as a starting subset for further functional screening in wet lab experiments. Several improvements of the methodology will be addressed in future. Furthermore, cross-genome comparisons of the results from our procedure between Mycoplasma gallisepticum and other Mycoplasma species are currently being investigated.

Methodology
Complete protein sequences of Mycoplasma gallisepticum (Strain R) were obtained from National centre for Biotechnology Information website [19].

Extraction of Unassigned regions
Mycoplasma gallisepticum protein sequences were scanned against a dataset of Hidden Markov Models (HMMs) obtained from the PfamA database (Pfam version 21.0) [5] which consists of 8957 families, employing the HMMpfam of the HMMER suite [11], with E-value threshold set to 0.1. Sequences or sequence regions, which are not associated with any domain in the above search, were considered as unassigned regions.

Filtering of Unassigned regions
The unassigned sequences thus obtained were subjected to different filtering steps.
a. To avoid false positives in the PSI-BLAST [20] search, we considered only unassigned sequences with at least 70 residues long. b. Transmembrane regions were excluded by using HMMTOP [21] and coiled-coil regions by using COILS [22] from the unassigned sequences. The above steps carried out to avoid non-specific hits in the PSI-BLAST [20] search. c. Standalone version of protein secondary-structure prediction program PSIPRED [23] was used to predict the α-helical, β-strand and coil (loop) content of different amino acids of unassigned regions. We employ 15% predicted secondary structural content as the minimum value, consistent with earlier work [13].

PSI-BLAST searches
The unassigned sequences, which have fulfilled the filtering criteria, were used to query non-redundant sequence database [24] employing PSI-BLAST [20], with low complexity filter turned on (E value cut off 0.001), to obtain homologues.

HMMpfam search
Only regions of homologues that aligned well with the query sequence were obtained from PSI-BLAST output to scan against a dataset of Hidden Markov Models (HMMs), which were obtained from the PfamA database (Pfam version 21.0) [5] which consists of 8957 families, employing HMMpfam of HMMER suite [11].
The indirect association of query sequence through homologous sequences with HMMs in the Pfam database gave rise to the definitions of full and partially associated domains. At least 75% of HMM should indirectly align with query to be considered as a fully associated domain and rest were considered as partially associated domains.

Multiple sequence alignments
Multiple sequence alignments of the query unassigned region and seed sequences of predicted domains which were obtained from Pfam [9] were performed using CLUS-TALW program [14]. Multiple sequence alignments of the query unassigned region and hits in the PSI-BLAST were also performed when unassigned regions, where hits were obtained by PSI-BLAST search, but not in HMMpfam search. When necessary, alignments were optimized by manual editing. Phylogenetic trees were calculated using Neighbor-Joining (NJ) method [25]. Phylogenetic tree was plotted using MEGA package [26].
Additional file 1: Supplemental Table ST1 Predicted putative new domain families. In our analysis, some of the unassigned regions shared homologues (at least 5 hits in PSI-BLAST search), but failed to associate with any Pfam domain in Hmmpfam search. We performed multiple sequence alignments to assess sequence conservation. 63 unassigned regions showed good conservation which may indicate putative new domain family. In the table, second column indicates the protein-id and third column (UR = Unassigned Region) indicates the start and end points of unassigned regions. Fourth column indicates the percentage secondary structural content in the unassigned region by using PSIPRED program. Fifth column indicates the total number of hits in the PSI-BLAST search and the sixth column indicates the number of PDB hits in PSI-BLAST search. Seventh column indicates the number of different species among the PSI-BLAST hits and eighth column indicates number of transmembrane helices which indicates whether protein is likely to be membrane-bound or not.
Additional file 2: Supplemental Table ST1 PfamB associations in the unassigned regions. 63 unassigned sequences were used to search against PfamB database (Pfam 24), only 20 unassigned sequences were associated with at least one PfamB domain. Unassigned sequence id along with start and end residues are showed in column two, PfamB start and end residue numbers are shown in column three, PfamB ids are shown in column four and the Expectation values with which PfamB domains associated are shown in the last column.