CGUG: in silico proteome and genome parsing tool for the determination of "core" and unique genes in the analysis of genomes up to ca. 1.9 Mb
© Seto et al; licensee BioMed Central Ltd. 2009
Received: 9 March 2009
Accepted: 25 August 2009
Published: 25 August 2009
Viruses and small-genome bacteria (~2 megabases and smaller) comprise a considerable population in the biosphere and are of interest to many researchers. These genomes are now sequenced at an unprecedented rate and require complementary computational tools to analyze. "CoreGenesUniqueGenes" (CGUG) is an in silico genome data mining tool that determines a "core" set of genes from two to five organisms with genomes in this size range. Core and unique genes may reflect similar niches and needs, and may be used in classifying organisms.
CGUG is available at http://binf.gmu.edu/geneorder.html as a web-based on-the-fly tool that performs iterative BLASTP analyses using a reference genome and up to four query genomes to provide a table of genes common to these genomes. The result is an in silico display of genomes and their proteomes, allowing for further analysis. CGUG can be used for "genome annotation by homology", as demonstrated with Chlamydophila and Francisella genomes.
CGUG is used to reanalyze the ICTV-based classifications of bacteriophages, to reconfirm long-standing relationships and to explore new classifications. These genomes have been problematic in the past, due largely to horizontal gene transfers. CGUG is validated as a tool for reannotating small genome bacteria using more up-to-date annotations by similarity or homology. These serve as an entry point for wet-bench experiments to confirm the functions of these "hypothetical" and "unknown" proteins.
There is a tremendous increase in the number of genomes deposited in databases, with the data stream already a "data tsunami". The universal adoption of the "Next Generation" DNA sequencing technologies will also allow a parallel, expedited sequencing of smaller, but important and relevant, genomes such as from viruses and less than 2 Mb bacterial genomes.
Software tools for taking advantage of these data need to be developed as well as maintained and upgraded for additional and more useful functions. In particular, the readily available and "user-friendly" computational tools, preferably platform-independent, are especially needed as many wet-bench researchers are interested in the informational content, the "biology," of the genomes rather than the computational aspects of these genomes.
CGUG is a modification and extension of a web-based tool, CoreGenes , which was limited to genomes of viruses (ca. 350 kb), including chloroplasts and mitochondria. It now determines the "core" set of genes from a set of up to five bacteria with small genomes (~2 Mb). Its usefulness in the small genomes community has attracted researchers with diverse interests and needs. In response to some of these interests and needs, the tool has been upgraded with the input of wet-bench researchers.
While bacteria with larger genomes, ca. 4+ Mb, are of obvious importance, bacteria with genomes of smaller sizes are also of interest to the community; many of these are pathogens. Tools for data mining and analysis of the genomes and proteomes from these and other pathogens are important not only for understanding their basic biology, but also in the applications of these data for molecular surveillance and detection, including molecular diagnostics, as well as in drug design and discovery, including vaccine development.
For understanding the phylogeny of organisms, the determination of a set of common or "core" genes between a set of bacterial genomes provides insight into the particular and specific characteristics of those bacterial species and of their niches in the biosphere. Core genes are being used to reconstruct ancestral genomes , phylogenies  and organism classifications , and should provide insight into the common requirements of living in similar niches. The core set of genes has been used to explore the concept of the "pan-genome" of a bacterial species or a group of bacteria . Essential genes comprising the minimal genome and the minimal life form, e.g., Mycoplasma genitalium  may be a subset of this core.
From a survey of the literature, there are relatively few tools for the determination of core genes from genomes. One example is CEGMA , which is used to annotate these in eukaryotic genomes. CEGMA is limited to the analysis of eukaryotic genomes. It is neither web-based nor functional across platforms, and must be downloaded and installed. Other tools have similar limitations or are confined to precomputed sets of genomes, or are no longer accessible/supported.
CGUG is a user-friendly "on-the-fly" web-based tool that determines, parses, analyzes and outputs a set of core genes from a set of two to five small bacterial genomes. As a validation of this tool, applications for analyzing Chlamydophila and Francisella genomes are presented, including reannotation, especially 'hypothetical proteins', illustrating the comparisons of newly-determined genomes with the analysis with older, less well-annotated genomes; that is, to align and to identify similar and also putatively similar proteins, previously noted as "unknown" and "hypothetical" entries. The current and future versions of this tool are available at http://binf.gmu.edu/geneorder.html.
In bacteriophage research, to complement the current classification criteria of the International Committee on the Taxonomy of Viruses (ICTV)  and to understand them better, a proteome tree analysis based on a BLASTP algorithm has been constructed earlier . CGUG provides another independent in situ proteome analysis approach that incorporates suggestions by several ICTV members working on bacteriophages , noting that while these genomes contain horizontal transfers that have made understanding bacteriophage classification very difficult , a proteome-based approach can help to unravel and to understand their classifications .
The algorithm is based on the GeneOrder algorithm to determine gene order and synteny . GenBank accession numbers are inputted to select data files. These are extracted from GenBank and an iterative protein similarity analysis is performed for each protein from the query genome against the reference genome protein database using BLASTP from WU-BLAST.
Currently, CGUG is limited to the analysis of small bacterial genomes (up to 2 Mb). Furthermore, it is limited to the analysis of five genomes at a time. Both limitations are due to the computational power and allocated memory of our server, which frequently comes under heavy user load; we hope to migrate this tool to a more powerful server. But for now, this tool is limited by computational resources (i.e., hardware) that restrict the size and number of genomes that can be processed. However, during our test runs, 4 Mb genomes can be processed successfully. The caveat is that there is a significantly longer processing time (> 1 hr; there is a queuing e-mail return option). Despite these limitations, CGUG is a valuable tool for biologists and this has been illustrated by its use in the classification of bacteriophages .
Chlamydophila analysis of core genes; annotation application Chlamydophila (1 Mb "small" genomes) are interesting because some are responsible for causing diseases in humans and other mammals: C. pneumoniae is a respiratory pathogen that causes community-acquired pneumonia and bronchitis in humans ; C. felis causes conjunctivitis and upper respiratory tract disease in cats ; C. abortus causes abortions in ruminants such as sheep and goats ; and C. caviae causes conjunctivitis in guinea pigs . Comparative genomics may provide insights into their biology as well as pathogenicity.
Accession numbers and sizes of five analyzed Chlamydophila genomes
Chlamydophila pneumoniae J138
Chlamydophila felis Fe/C-56
Chlamydophila abortus S26/3
Chlamydophila pneumoniae AR39
Chlamydophila caviae GPIC
Francisella tularensis SCHU S4
Francisella tularensis holarctica
Francisella tularensis mediasiatica
Genome annotation and methods for annotation have lagged behind the DNA sequencing technology, in part, due to the vast unknown of the biology and coding potential of organisms. Genomes that have been sequenced more recently take full advantage of newly accumulated knowledge, and therefore are annotated more completely and, presumably, with less error. For the non-computational biologist who is interested in the biology of related organisms, inspection and alignments of genomes annotated from different time periods may be problematic. CGUG allows older genomes to be matched with related and recently sequenced genomes.
Application to the larger Francisella genomes
Bacteriophages have been intensely studied in the laboratory, and their classifications have been debated and defined under current ICTV criteria, which include physical, clinical, biochemical and molecular data. Recently, several bacteriophage researchers have undertaken a re-evaluation of the bacteriophages given the availability of genome data and the in situ proteome data. This data analysis included parsing the numbers of shared similar and orthologous proteins, using both CoreGenes and CoreExtractor.vbs . The majority of the accepted relationships and ICTV classifications have been re-confirmed for the Podoviridae, although several new insights appeared. One example, three established genera within the T7-related bacteriophages are reconfirmed, along with five putative novel genera. These proteome-inspired insights offer a refinement to the ICTV phage classification and provide a straightforward algorithm for the classification of new phage based on their genome and proteome . The entire set of bacteriophages is being re-examined, beginning with the Podoviridae, above, and continuing with the Myoviridae, with plans for Siphoviridae and the rest.
Accession numbers and sizes of analyzed bacteriophage genomes
Enterobacteria phage α3
Enterobacteria phage G4
Enterobacteria phage ϕX174
Enterobacteria phage S13
Enterobacteria phage ϕK
Chlamydia phage 1
Chlamydia phage 2
Chlamydia pneumoniae phage CPAR39
Chlamydia phage ϕCPG1
Bdellovibrio phage ϕMH2K
Spiroplasma phage 4
Enterobacteria phage T7
Enterobacteria phage P22
Enterobacteria phage lambda
Percent identities and E-values between shared proteins of Chlamydia phage 1 and Chlamydia phage ϕCPG1
Chlamydia phage 1
Chlamydia phage ϕCPG1
capsid protein VP2-related protein
capsid protein VP3
Bdellovibrio phage ϕMH2K, which belongs to the Bdellomicrovirus genus, shares significantly less than 40% similar proteins with the phages of the Microvirus genus. Specifically, it shares no similar proteins with ϕX174, G4 and ϕK. It only shares one protein with α3 and S13. Bdellovibrio phage ϕMH2K also shares less than 40% similar proteins with a phage of the Spiromicrovirus genus, Spiroplasma phage 4. These results justify the current separation of Bdellovibrio phage ϕMH2K from the Microvirus and Spiromicrovirus genera. In contrast, Bdellovibrio phage ϕMH2K shares approximately 45% similar proteins with the phages of the Chlamydiamicrovirus genera. There are discussions on merging these two genera; these in silico proteome results from CGUG lend more support to this position.
Software development is an on-going process, both in terms of coding and hardware as well as research needs. CGUG is an example of this, being supported and updated in response to requests from researchers, e.g., reanalysis of all bacteriophages, and supported in regards to coding updates. A beta version (CGUG 3.1), at the same site, is an alternative and complementary upgrade that will continue to be improved. It provides a more robust user interface (UI) and aims to improve the user experience, including a time bar to monitor the run length. It provides for a better batch analysis, recommended especially for long running queries, such as for the 2 Mb genomes, and in preparation for the much larger bacterial genomes in the future, ca. > 4 Mb. Algorithm enhancements are needed and planned, as the current implementation does not handle these long running queries robustly. The feature list below summarizes anticipated current and continuing work:
Improve user interface (UI)
◦ Show a dynamic status indicator of query progress
◦ Allow user to elect to receive results via email at any time
Review implementation of algorithm for performance
Add persistence (e.g., database) of queries and results by user
CGUG is an in silico genome and proteome data mining tool that is useful in the analysis of core genes from small-genome bacteria (~2 Mb), and in the putative assignments and suggestions of function for genes previously annotated as unknown or hypothetical, taking advantage of the new genomes and annotations as well as the growing databases for protein function assignment.
Another dimension of CGUG is realized in the reanalysis and verification of the current classifications of organisms, for example in the reanalysis and new insights of bacteriophages.
Availability and requirements
Project name: CGUG
Operating system(s): Platform independent web-based
Programming language: Java, XML
Any restrictions to use by non-academics: License required for commercial usage
We acknowledge gratefully Drs. Andrew Kropinski and Rob Lavigne for suggestions of features, and for their collaboration and validation in applying CGUG to their studies of bacteriophages. We thank Chris Ryan for providing systems administration and server support and Jason Seto for providing support and a critical reading and editorial comments. We are grateful to the Apache Software Foundation (Tomcat), the Regents of the University of California (Ptolemy Plot) and Google (Google Web Toolkit) for allowing community access to their software as open resources.
- Zafar N, Mazumder R, Seto D: CoreGenes: a computational tool for identifying and cataloging "core" genes in a set of small genomes. BMC bioinformatics. 2002, 3: 12-10.1186/1471-2105-3-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Koonin EV: Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature reviews. 2003, 1 (2): 127-136. 10.1038/nrmicro751.PubMedGoogle Scholar
- Lerat E, Daubin V, Moran NA: From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS biology. 2003, 1 (1): E19-10.1371/journal.pbio.0000019.PubMed CentralView ArticlePubMedGoogle Scholar
- Lavigne R, Seto D, Mahadevan P, Ackermann HW, Kropinski AM: Unifying classical and molecular taxonomic classification: analysis of the Podoviridae using BLASTP-based tools. Research in microbiology. 2008, 159 (5): 406-414. 10.1016/j.resmic.2008.03.005.View ArticlePubMedGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (39): 13950-13955. 10.1073/pnas.0506758102.PubMed CentralView ArticlePubMedGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Current opinion in genetics & development. 2005, 15 (6): 589-594. 10.1016/j.gde.2005.09.006.View ArticleGoogle Scholar
- Parra G, Bradnam K, Korf I: CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics (Oxford, England). 2007, 23 (9): 1061-1067. 10.1093/bioinformatics/btm071.View ArticleGoogle Scholar
- Fane B:Microviridae. Virus taxonomy: classification and nomenclature of viruses: eighth report of the International Committee on Taxonomy of Viruses. Edited by: Fauquet C, Mayo MA, Maniloff J, Desselberger U, Ball LA. 2005, San Diego; London: Elsevier Academic Press, 288-299.Google Scholar
- Rohwer F, Edwards R: The Phage Proteomic Tree: a genome-based taxonomy for phage. Journal of bacteriology. 2002, 184 (16): 4529-4535. 10.1128/JB.184.16.4529-4535.2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Mazumder R, Kolaskar A, Seto D: GeneOrder: comparing the order of genes in small genomes. Bioinformatics (Oxford, England). 2001, 17 (2): 162-166. 10.1093/bioinformatics/17.2.162.View ArticleGoogle Scholar
- Hahn DL, Azenabor AA, Beatty WL, Byrne GI: Chlamydia pneumoniae as a respiratory pathogen. Front Biosci. 2002, 7: e66-76. 10.2741/hahn.View ArticlePubMedGoogle Scholar
- Cai Y, Fukushi H, Koyasu S, Kuroda E, Yamaguchi T, Hirai K: An etiological investigation of domestic cats with conjunctivitis and upper respiratory tract disease in Japan. The Journal of veterinary medical science/the Japanese Society of Veterinary Science. 2002, 64 (3): 215-219.View ArticlePubMedGoogle Scholar
- Szeredi L, Janosi S, Tenk M, Tekes L, Bozso M, Deim Z, Molnar T: Epidemiological and pathological study on the causes of abortion in sheep and goats in Hungary (1998–2005). Acta veterinaria Hungarica. 2006, 54 (4): 503-515. 10.1556/AVet.54.2006.4.8.View ArticlePubMedGoogle Scholar
- Strik NI, Alleman AR, Wellehan JF: Conjunctival swab cytology from a guinea pig: it's elementary!. Veterinary clinical pathology/American Society for Veterinary Clinical Pathology. 2005, 34 (2): 169-171. 10.1111/j.1939-165X.2005.tb00034.x.View ArticlePubMedGoogle Scholar
- Nigrovic LE, Wingerter SL: Tularemia. Infectious disease clinics of North America. 2008, 22 (3): 489-504. 10.1016/j.idc.2008.03.004.View ArticlePubMedGoogle Scholar