BMX: a tool for computing bacterial phyletic composition from orthologous maps
© Kulohoma; licensee BioMed Central. 2015
Received: 29 August 2014
Accepted: 13 February 2015
Published: 24 February 2015
New sequencing technologies have made it possible to explore genetic diversity at higher resolution in microbial populations. However, our understanding evolutionary relationships, and comparison of closely and distantly related bacterial genomes from these massive datasets remains a formidable challenge. Numerous clustering algorithms that group genomic data based on homology have been developed, but new tools are still required to analyse the resultant orthologous maps to understand functional genetic similarities and their phyletic patterns (patterns of presence of absence of genes).
Bacterial Makeup eXplorer (BMX) implements an algorithm that swiftly and efficiently facilitates the determination of the number of orthologs in prokaryotic genomes employing a reference free approach, which may be further exploited to transfer of gene annotations. BMX is able to integrate orthologous maps of highly diverse prokaryotic genomes therefore making it possible to perform robust and scalable, multi-platform, high quality annotation transfer and gene-by-gene composition assessment method. In addition results are presented in the form of publication quality figures.
BMX allows extensive data analysis of orthologous map databases to understand underlying biological relationships. Furthermore, BMX is portable across different platforms and can be installed easily. In summary, BMX allows higher resolution analysis of genomes from diverse bacterial populations
KeywordsOrthologous maps Core genome Prokaryotes
The concept of orthology provides an important basis for studying mechanisms of bacterial genome evolution, functional genetics, and biological networks. Orthologs are genes descended from a common ancestor as a result of speciation, which are likely to retain the same function in the course of evolution [1,2]. They contrast with paralogs, which may retain the same function or evolve new functions that are related to or different from the original function due to lack of the original selective pressure on one or more copies of the duplicated genes [1-3]. However, it is important to note that function retention among orthologs has some notable exceptions . Determination of orthology between bacterial genomes provides a detailed understanding of bacterial population genetics and evolutionary biology [5,6]. This has greatly been enhanced by recent rapid and cost-effective advances in sequencing technologies, which have provided highly accurate and reproducible large-scale datasets required to unravel the broad spectrum of genetic diversity from different microbes . These datasets have allowed higher resolution analysis of large bacterial populations, and significantly improved our understanding of diversity between and across species.
Numerous computational methods that accurately and efficiently infer orthology from sequenced genomes have been developed [8,9]. However, effective algorithms are still required to gain functional insight from the extensive generated homologous maps . Recently, more emphasis has been redirected towards identification of orthologs, which are crucial elements of comparative genomics applications [6,11,12]. Ortholog annotation allows easy comparison of genomes and can highlight key conserved antigens and virulence complements found in the genomes of clinically important bacterial strains under study, which could be exploited as novel potential targets for therapeutic approaches [13-15].
We describe Bacterial Makeup eXplorer (BMX), a highly efficient, stand alone tool that can be used to easily and effectively compute the size and composition of the genes common to all strain genomes, within each gene cluster under study (referred to henceforth as the core genome) from OrthoMCL  generated orthologous gene maps. Most genome alignment approaches require high quality reference sequences and are problematic when comparing distantly related species or species with high rates of recombination . BMX is a robust reference-free highly scalable approach capable of integrating datasets containing highly divergent genomes. Moreover, BMX can also be exploited as an annotation transfer tool by including a known well-annotated reference genome in the dataset, thereby allowing inference of a gene cluster function. Its features include: data reorganization, genome clusters quality assessment, iterative random genome sampling, and core and accessory genome distinction, and size and composition computation; generating accurate molecular barcodes from whole genome sequence data.
The number of random sampling events when determining the core genome size given N number of genomes, S N , is established using the least number of genomes under consideration, which is set to 1 genome by default, a 1 , the total number of genomes under consideration (i.e. the dataset size), N, and the common difference of successive genomes, d, (i.e. 1) used during sampling. During random sampling, the total number of genomes, N, is initialized as 1 genome for the first event, and increased by one unit for each of the subsequent events, until it is equivalent to the total number of genomes in the dataset in the final event. The random genomes are only sampled once during each event and the total number of orthologs shared by all genomes counted once for each cluster, thereby excluding paralogous counts. The default setting uses 100 iterations resulting in 100 different input orders per event, and the more advanced users may alter this value depending on their preference, in the BMX script L.pl. However, it’s important to note previous tests show that 100 iterations give the optimal results. The average core genome size per event is computed enabling correlation with the number of genomes sampled. Poor quality genomes can be identified for exclusion, by visual examination of a quality control diagram generated using R scripts. Core and accessory genome size and composition comparisons between different datasets can be achieved by this approach, thereby allowing annotations to be easily transferred to genes of draft genomes. A plot of the core genome size, which only consists of clusters that have a gene in every taxon, is generated using R scripts.
Results and discussion
BMX has already been used on highly divergent bacterial genomes of Streptococcus pneumoniae (pneumococcus) (Additional file 1: Table S1). Pneumococci are clinically important and have highly recombigenic genomes from multiple diverse lineages, which make it challenging to study their evolutionary history . BMX was used to assess the composition of 140 invasive disease pneumococcal genomes. Datasets of 16, 70 and 140 Streptococcus genome maps are included in the software download itself, and can be found at this url: http://sourceforge.net/projects/bmexplorer/files/latest/download. The estimated run times for analysis of these datasets using BMX are: 16 genomes in less than 3 mins, 70 genomes in 15–20 mins, and 140 genomes in 45 mins on a 2.7 GHz Intel Core i7 processor and 8Gb 1333 MHz DDR3 memory. Paralogs, core genes and other accessory genes were clearly distinguished, allowing the subsequent robust reconstruction of phylogenetic trees. BMX can be exploited to identify relevant functional sets of core genes, which encode important virulence complements always present in the genome that can be exploited as therapeutic and diagnostic targets.
BMX is a scalable, reference-free framework for computing gene-by-gene core and accessory genome composition and making comparisons between datasets. We believe that BMX will facilitate hypothesis generation and design of new experiments that explore the genomic diversity of bacteria, and its use can be extended to other prokaryotic organisms.
Availability and requirements
Project name: BMX.
Project home page: http://sourceforge.net/projects/bmexplorer/
Operating system(s): Multiple platforms.
Programming language: Perl, BioPERL and R.
Other requirements: The following modules must be installed in perl: Bio::Perl, IO::String, Bio::SeqIO, List::Util, List::Util ‘max’, Text::CSV, integer.
The following libraries/packages must be installed in R: RColorBrewer, gplots.
License: GNU GPL.
Any restrictions to use by non-academics: none.
Availability of supporting data
The data set of the orthologous map from 140 invasive disease pneumococcal genomes supporting the results of this article is available in the publically accessible source forge repository, http://sourceforge.net/projects/bmexplorer/.
The author thanks Dean Everett, Rob Heyderman, Aras Kadioglu, Martin Maiden and Chisomo Msefula for their constructive comments.
- Jensen RA. Orthologs and paralogs - we need to get it right. Genome Biol. 2001;2:INTERACTIONS1002.1–1002.3.View ArticleGoogle Scholar
- Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000;16:227–31.View ArticlePubMedGoogle Scholar
- Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113.View ArticlePubMedGoogle Scholar
- Studer RA, Robinson-Rechavi M. How confident can we be that orthologs are similar, but paralogs differ? Trends Genet. 2009;25:210–16.View ArticlePubMedGoogle Scholar
- Fraser C, Alm EJ, Polz MF, Spratt BG, Hanage WP. The bacterial species challenge: making sense of genetic and ecological diversity. Science. 2009;323:741–6.View ArticlePubMedGoogle Scholar
- Maiden MC, van Rensburg MJ, Bray JE, Earle SG, Ford SA, Jolley KA, et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nat Rev Microbiol. 2013;11:728–36.View ArticlePubMed CentralPubMedGoogle Scholar
- Parkhill J. What has high-throughput sequencing ever done for us? Nat Rev Microbiol. 2013;11:664–5.View ArticlePubMedGoogle Scholar
- Altenhoff AM, Dessimoz C. Inferring orthology and paralogy. Methods Mol Biol. 2012;855:259–79.View ArticlePubMedGoogle Scholar
- Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24:539–51.View ArticlePubMedGoogle Scholar
- Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, et al. Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol. 2010;11:R107.View ArticlePubMed CentralPubMedGoogle Scholar
- Dessimoz C, Gabaldon T, Roos DS, Sonnhammer EL, Herrero J. Toward community standards in the quest for orthologs. Bioinformatics. 2012;28:900–4.View ArticlePubMed CentralPubMedGoogle Scholar
- Parsons M, Myler PJ, Berriman M, Roos DS, Stuart KD. Identity crisis? The need for systematic gene IDs. Trends Parasitol. 2011;27:183–4.View ArticlePubMed CentralPubMedGoogle Scholar
- Serruto D, Rappuoli R. Post-genomic vaccine development. FEBS Lett. 2006;580:2985–92.View ArticlePubMedGoogle Scholar
- Barocchi MA, Censini S, Rappuoli R. Vaccines in the era of genomics: the pneumococcal challenge. Vaccine. 2007;25:2963–73.View ArticlePubMedGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A. 2005;102:13950–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Li L, Stoeckert Jr CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–89.View ArticlePubMed CentralPubMedGoogle Scholar
- Sonnhammer EL, Gabaldon T, da Silva AW S, Martin M, Robinson-Rechavi M, Boeckmann B, et al. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014;30:2993–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Hanage WP, Fraser C, Tang J, Connor TR, Corander J. Hyper-recombination, diversity, and antibiotic resistance in pneumococcus. Science. 2009;324:1454–7.View ArticlePubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.