BamBam: genome sequence analysis tools for biologists
© Page et al.; licensee BioMed Central Ltd. 2014
Received: 28 August 2014
Accepted: 24 October 2014
Published: 24 November 2014
Massive computational power is needed to analyze the genomic data produced by next-generation sequencing, but extensive computational experience and specific knowledge of algorithms should not be necessary to run genomic analyses or interpret their results.
We present BamBam, a package of tools for genome sequence analysis. BamBam contains tools that facilitate summarizing data from BAM alignment files and identifying features such as SNPs, indels, and haplotypes represented in those alignments.
BamBam provides a powerful and convenient framework to analyze genome sequence data contained in BAM files.
KeywordsSAM BAM Haplotype SNP Indel Consensus Methylation Expression
Massive amounts of data are involved in genome sequence research, requiring researchers to use supercomputing clusters and complex algorithms to analyze their sequence data. Genomic analyses frequently include next-generation sequencing to produce millions of short reads, followed by aligning of reads to a reference genome sequence with software like GSNAP and Bowtie 2 [1, 2]. These programs generate SAM files, the accepted standard for storing short read alignment data, which are subsequently compressed to BAM format via SAMtools . The BAM files must then be analyzed and compared to produce meaningful results. Here we expand on the body of tools for analyzing and comparing BAM files.
The core independent tools of BamBam
Single nucleotide Polymorphisms
Call SNPs between two or more samples
Impute genotypes in output from InterSnp
Phase haplotypes with K-means
Copy number Variants
Identify deletions between two samples
Identify covered regions
Call copy number variants with HMM
Summarize base pair methylation in bisulfite-sequence data
Generate consensus sequences from one or more samples
Extract mapped and unmapped reads from BAM files
Summarize read coverage of sequences or regions
Extract subset of mapped reads
Categorize reads by genome based on similarity to parents
The BamBam package includes several independent programs, briefly described below. Brief tests were carried out to compare InterSnp, GapFall, and HapHunt with similar tools (Additional file 1). The latest version of PolyCat is also included . The README in the download package provides example commands for various common analyses, including phylogeny inference, molecular evolution estimation, methylation analysis, and differential expression analysis. A usage guide (see Additional file 2) provides a more detailed walkthrough of some workflows.
Single nucleotide polymorphisms
InterSnp calls SNPs between samples, represented by separate BAM files. InterSnp examines each position in the genome, assigning consensus alleles to each site for each sample. A SNP is called whenever two samples differ at the same position, producing a table with the genotypes of all samples at all polymorphic sites. The output is a table with the sequence name, position, and genotype for each sample at that site on each row, which can be readily processed by common command-line programs or scripts to calculate statistics or produce marker data for other programs.
Pebbles imputes genotypes using the K-nearest neighbor algorithm [5, 6]. For each unknown genotype, Pebbles finds the samples that are most similar at nearby loci. Then it assigns a genotype to the unknown locus based on the weighted contributions of those neighbors. Pebbles operates on InterSnp output—a table of genotypes—and produces a file of the same form.
HapHunt first selects K reads distant from one another to serve as haplotype seeds. It assigns each other read to the haplotype with the closest consensus sequence. Then it recalculates the consensus sequences based on the reads in each haplotype and repeats the process of assigning each read to the haplotype with the closest consensus sequence. It repeats this process a given number of times, calculating a score at the end of each round based on the difference of the smallest interhaplotype distance and greatest intrahaplotype distance. This score favors clusterings in which haplotypes are individually compact and most distinct from one another. This score can optionally be scaled by the average size ratio for each pair of haplotypes, favoring clusterings that are more evenly divided. The consensus sequences of the final haplotypes are printed as an aligned FASTA file for each sequence in the original reference.
Copy number variants
Gapfall identifies large deletions between samples based on read coverage. It searches the genome for extended regions that have high coverage in one sample but no coverage in the other. A large region with no coverage could indicate a physical deletion (for genomic samples) or a deactivated gene (for RNA-seq). These putative deletions are reported as an annotation file that can be visualized with a genome browser such as IGV .
Eflen identifies and extracts regions in a BAM file that are covered by at least a user-specified number of reads and outputs those regions as a GFF file. Provided with multiple BAMs, Eflen will identify regions that are covered in at least a user-specified fraction of those BAMs. This tool can be especially useful for analyzing GBS or RNA-seq data.
HMMph identifies CNVs between samples based on read coverage. BAM files must be provided for a control and for the sample of interest. The coverage ratio between those two BAM files is normalized by the total read coverage. Then the copy number of each locus in a sliding window is modeled based on a Poisson distribution in an untrained Hidden Markov Model [13, 14].
It is often useful to be able to compute on specific genomic intervals, such as genes. GeneVisitor provides a quick and easy way to do this, using an annotation file (GFF or BED format) to call a function on each indicated region of the genome. This class can be used by C++ programmers to run custom functions. In addition, pre-built tools utilize GeneVisitor without the need for programming.
Bam2Consensus converts one or more BAM files into a series of FASTA-formatted consensus sequences. If desired, multiple sequences—essentially unphased haplotypes—can be produced per BAM file, facilitating analyses of heterozygosity, nucleotide diversity, and molecular evolution. Suppose you have several BAM files representing different accessions of a species, all mapped to a common genome reference sequence. With a single command, Bam2Consensus can produce an aligned FASTA file for each gene, each containing the consensus sequences for each accession.
Bam2Fastq extracts mapped or unmapped reads from a BAM file, or from select regions of the BAM file.
Counter summarizes the number of reads mapped to each annotated region in one or more BAM files. RPKM (Reads Per Kilobase per Million mapped reads) normalization can be applied if desired. The output of Counter is a table of features and read counts, ready to be imported into EdgeR for differential expression analysis .
SubBam extracts a subset of a BAM file. It can optionally modify the BAM file, changing the coordinates of mapped reads to match a new reference that is a subset of the original reference. Suppose you have WGS reads mapped to a reference sequence and are interested in several loci. SubBam can produce BAMs that only contain the loci of interest, with a coordinate system corresponding to the position in the locus, rather than in the genome as a whole.
In addition to the core tools mentioned above, BamBam includes many Perl scripts, many of which use BioPerl modules . Script functions include calculation of nucleotide diversity (π) and molecular evolution rates (Ka and Ks), paralog identification, differential expression with EdgeR , summarization of results from MetHead, and summarization of genotype tables produced by InterSnp and Pebbles.
The BamBam tools form a simple interface between the researching biologist and the wealth of data contained in next-generation sequence alignments. They provide a means to efficiently identify interesting genomic features and summarize data, facilitating many next-generation sequence analysis experiments.
Availability and requirements
Project Name: BamBam
Project Home Page: http://sourceforge.net/projects/bambam/
Operating System: Unix
Dependencies: SamTools, BamTools, BioPerl
Programming Language: C++ and Perl
JP has a B.S. in Computer Science and is currently a graduate student in Biology, focusing on developing tools for polyploid genome analysis. ZL is an undergraduate student in the Udall lab. MH is a graduate student in the Udall lab. JU is an Associate professor at Brigham Young University and is the academic advisor of JP, ZL, and MH.
The authors would like to thank NSF Plant Genome Research Program, Cotton, Inc., and BYU Mentoring Grants for funding. They would also like to thank the Fulton Supercomputing Lab for their invaluable computational resources and technical support.
- Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010, 26: 873-881. 10.1093/bioinformatics/btq057.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with bowtie 2. Nat Meth. 2012, 9: 357-359. 10.1038/nmeth.1923.View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J: The sequence alignment/map format and SAMtools. 2009, Oxford, UK: BioinformaticsGoogle Scholar
- Page JT, Gingle AR, Udall JA: PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms. G3 (Bethesda). 2013, 3: 517-525. 2013.View ArticleGoogle Scholar
- Rutkoski JE, Poland J, Jannink J-L, Sorrells ME: Imputation of unordered markers and the impact on genomic selection accuracy. G3 (Bethesda). 2013, 3: 427-439. 2013.View ArticleGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics. 2001, 17: 520-525. 10.1093/bioinformatics/17.6.520.PubMedView ArticleGoogle Scholar
- Bansal V, Bafna V: HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. 2008Google Scholar
- Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009, 84: 210-223. 10.1016/j.ajhg.2009.01.005.PubMedPubMed CentralView ArticleGoogle Scholar
- He D, Eskin E: Hap-seqX: expedite algorithm for haplotype phasing with imputation using sequence data. Gene. 2012, 518: 2-6.PubMedView ArticleGoogle Scholar
- Lloyd S: Least squares quantization in PCM. IEEE Trans Inf Theory. 1982, 28: 129-137. 10.1109/TIT.1982.1056489.View ArticleGoogle Scholar
- Ding C, He X: K-Means Clustering via Principal Component Analysis. 2004, New York, NY, USA: ACM Press, 29-Google Scholar
- Thorvaldsdóttir H, Robinson JT, Mesirov JP: Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013, 14: 178-192. 10.1093/bib/bbs017.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao M, Wang Q, Wang Q, Jia P, Zhao Z: Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013, 14: S1-View ArticleGoogle Scholar
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989, 77: 1-30.View ArticleGoogle Scholar
- Robinson MD, Robinson MD, McCarthy DJ, McCarthy DJ, Smyth GK: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009, 26: 139-140.PubMedPubMed CentralView ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE: The bioperl toolkit: Perl modules for the life sciences. Genome. 2002, 12: 1611-1618. 10.1101/gr.361602.View ArticleGoogle Scholar
- Barnett DW, Garrison EK, Quinlan AR: BamTools: A C++ API and Toolkit for Analyzing and Managing BAM Files. 2011, Oxford, UK: BioinformaticsGoogle Scholar
- Drummond AJ, Ashton B, Buxton S, Cheung M: Drummond: Geneious v5. 4 - Google Scholar. 2011, Aukland, New Zealand: Biomatters LtdGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.