BarraCUDA - a fast short read sequence aligner using graphics processing units
- Petr Klus†1,
- Simon Lam†2,
- Dag Lyberg3,
- Ming Sin Cheung4,
- Graham Pullan5,
- Ian McFarlane2,
- Giles SH Yeo1 and
- Brian YH Lam1Email author
© Klus et al; licensee BioMed Central Ltd. 2012
Received: 16 November 2011
Accepted: 13 January 2012
Published: 13 January 2012
With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence.
Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput.
BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology.
BarraCUDA is currently available from http://seqbarracuda.sf.net
Next-generation sequencing (NGS) is a technique based on sequencing by synthesis or sequencing by ligation in a massively parallel fashion and generates short sequencing reads ranging from 25 to 400 bp. The first commercially available next-generation sequencer, the Genome Sequencer 20 was released by 454 Life Sciences in 2005  with the hope of enabling the analysis of complete genomes within a short period of time . It produced a throughput of 20 megabases from a 5-hour run, which was 30 fold higher than traditional Sanger capillary electrophoresis systems. Over these years, the output of next-generation sequencers has increased 30, 000 fold to 600 gigabases from a single instrument run (Illumina Hiseq 2000) which is about 200 times the depth of coverage of an entire human genome.
The advancement of the technology has generated an enormous amount of sequence data . Sequence alignment is one of the first steps for downstream data analyses, during which sequencing reads have to be mapped either to other reads to form a genome (also known as de novo sequence assembly) [1, 4]; or on to a reference DNA sequence, usually a genome, for downstream applications such as single-nucleotide polymorphism (SNP) discovery , Chip-Seq  or RNA-Seq . Here we focus on the latter type of alignments.
A handful of software packages have been developed for computing alignments of sequencing reads generated by NGS instruments on to a reference sequence. Early generation aligners such as MAQ , RMAP [8, 9] and Soap  use hash-based algorithms to perform the mapping of sequencing reads on to reference genomes. Even though these tools can be accelerated by several seeding approaches, or parallelized by using multiple cores and cloud computing (e.g. CloudBurst ), the computational cost remains expensive. For example, by extrapolating the result from Schtaz et al. , it would take over 30, 000 CPU hours to align reads generated from a single HiSeq 2000 run.
Later in 2009, a new generation of sequence aligners were released, namely Soap2 , Bowtie  and BWA . These tools use a suffix tree-based algorithm (also known as FM-index) based on Burrows-Wheeler Transform (BWT). BWT is a block-sorting algorithm originally designed for lossless data compression , it can also be used for string matching by a backward search approach . The major advantages of this approach include a low time complexity of O(n) to find an exact match where n is the length of the query  and the performance is independent from the size of the reference sequence. In addition, a high compression ratio of 2.5 bit per base for the reference genome  also means a full human genome can fit into 1.3 GB of space. The new algorithm is a magnitude quicker and much more memory efficient than their hash-based predecessors . In an experiment performed by Li and Durbin  the alignment throughput for 12.2 million 51 bp reads being mapped to the Human genome went down from 94 CPU hours (MAQ) to 4 CPU hours (BWA) on a 2.5 GHz Intel Harpertown-based Xeon E5420 while retaining comparable accuracy in alignment mapping.
Modern graphics cards are designed primarily for rendering real time, high-definition complex 3D graphics features for various visual applications such as gaming and graphics design. Each graphics processing unit, or GPU, consists of many high performance stream processors capable of performing billions of independent calculations per second in order to meet the high visualization demand required for graphics applications. It is this processing capability that can be translated into a general-purpose computation capability equivalent to a small cluster of traditional CPUs. In addition, the lower energy profile and cost means the use of GPU to perform parallel computing tasks has become increasingly attractive. Many modern supercomputers including the Chinese Tianhe-1A, Nebulae and Japanese Tsubame 2.0 (http://www.top500.org/lists/2010/11) also contain multiple GPU nodes on top of traditional nodes with CPUs to take advantage to the parallel computing capability of GPUs.
MUMmerGPU [18, 19] is one of the first GPGPU-based DNA alignment software that utilizes the NVIDIA Compute Unified Device Architecture (CUDA) to perform variable-length maximal exact DNA alignments. Unlike other BWT aligners it uses a different suffix-tree approach, namely Ukkonen's algorithm  to find exact matches in the reference sequence of all the sub-strings in the query DNA sequence. The current version of MUMmer GPU out-performs its CPU counterpart by 13 fold [18, 19]. However, unlike other sequence aligners mentioned in the previous section, MUMmerGPU does not support inexact alignments by itself and has to be used in conjunction with the original MUMmer software package  to perform inexact alignments.
Here we introduce BarraCUDA, a program that takes advantage of GPGPU to perform inexact alignment of sequencing reads on to a reference sequence. BarraCUDA is built on the foundation of BWA and we have rewritten the BWT-based alignment core module to make use of the massive parallelism of GPGPU. It also employs the fast and memory efficient BWT-based algorithm employed in the original software and supports mismatches and full gapped alignment of sequencing reads.
BarraCUDA utilises NVIDIA's GPGPU implementation, namely Compute Unified Device Architecture (CUDA) to parallelise the alignment of sequence reads. Firstly, the program loads the complete BWT-encoded reference sequence and sequence reads from disk to GPU memory; This is followed by launching a GPU alignment kernel, where the alignment task of each of the sequence reads are distributed to hundreds of processors within the GPU and computations are performed in parallel; Once the kernel finishes, the alignment results are transferred from GPU back to disk. The following sections describe the details of each of the steps performed in BarraCUDA. (Please also refer to the Additional file 1 for the pseudo-code algorithm framework)
1. Transferring BWT-encoded reference sequence and sequence reads from disk to GPU
BarraCUDA first loads the full BWT suffix array from disk into cached texture memory in the GPU using a 1-dimensional uint4 array to ensure fast data access. Sequence reads are loaded into GPU memory in batches and packed in a single continuous block to minimise internal fragmentations, and the data is also bound to the texture cache to maximise the data throughput.
2. CUDA thread assignments
Mapping a sequence read to a reference sequence is a data independent process and does not require any information from any of the other reads, thus BarraCUDA employs a straightforward data parallelism by assigning an alignment kernel thread to each of the individual sequencing reads and launching the GPU kernel with tens of thousands of threads at the same time.
3. Inexact sequence alignment--a depth-first search GPU kernel
4. Multiple kernel design
5. Alignment data management
During runtime, the alignment data for each of the sequence reads including BWT suffix array (SA) coordinates and the number of differences is stored temporarily in GPU memory. Once the kernel finishes, the data are copied from the GPU back to the host and subsequently written onto disk storage in a binary file format. Similar to BWA, the SA coordinates can then be translated to linear space using BarraCUDA's 'samse' or 'sampe' cores for single-end or paired-end libraries respectively.
BarraCUDA shares a similar alignment accuracy as BWA
Alignment accuracy compared to BWA using simulated reads
Ungapped alignment has minimal effects on alignment accuracy
Gapped alignment is costly in terms of alignment throughput, due to the much larger search space O(9 n ) compared to O(4 n ) when gap opening is disabled (Figure 1a). In a separate experiment, we performed the alignment of the same set of data above with gap opening disabled (using option '-o 0') and found that the number of confident mappings and the error rates were largely unaffected (Table 1).
The choice between gapped and ungapped alignment is largely dependent on the nature of the sequencing experiments. For re-sequencing studies, gapped alignment is essential to minimise false positive variant calls . On the other hand, we would recommend disabling gap opening using option '-o 0' for experiments such as Chip-seq or RNA-seq for good performance.
Gapped alignment throughput of a GPU is equivalent to that of 6 CPU cores
The alignment throughputs of BarraCUDA and BWA were measured by mapping 2 sets of paired-end whole-genome shotgun libraries containing sequencing reads of 37 bp, and 76 bp in length (1000 Genomes Project, European Nucleotide Archive accession: ERR003014 and SRR032215 respectively) on to the human genome (NCBI36.54).
For BWT-index construction, both BWA and BarraCUDA utilize the same BWT-indexing core  and took about 1.5 h to complete (data not shown). The encoded genome was 2.6 GB including both the forward and reverse BWT indices. It is useful to note that all BWT-related files (.bwt,.rbwt,.ann,.sa,.rsa,.pac,.rpac) only needed to be generated once, and that the files are compatible between the two programs.
For ungapped alignments, while BWA exhibited a speed increase of 14.6% with 6 threads, we observed a 2.1 fold speed-up in BarraCUDA compared to gapped alignment (Figure 3b). In this experiment, BarraCUDA only took 5 m 46 s to align all 11.3 million pairs of read onto the human genome, almost half the time taken by BWA with 6 threads.
The SAM conversion 'samse' and 'sampe' cores in BWA, which convert alignments coordinates from BWT SA intervals back to linear space on the reference genome, are comparative less computational intensive than the alignment core as seen in Figure 3 and thus do not make use of multiple threads. We did not port the 'samse/sampe' to GPU in this version of BarraCUDA, but we improved the conversion speed by an average of 27% through the use of a more efficient memory management strategy in working with BWT indices.
With increased number of reads and a longer read length when aligning the 76 bp library, both software took a longer time to complete the mappings (Figure 3c). BWA with a single thread took almost 4 h to complete the gapped alignment of 14 million pairs of sequencing reads in the library, it was significantly shortened to 46 m 10 s when all 6 cores on the X5670 were used. BarraCUDA took 40 m 1 s to complete the alignment that was again similar to BWA with 6 threads.
When gap opening was disabled, the throughput of BWA was doubled (Figure 3d). Similarly, the time taken for BarraCUDA was also significantly shortened when gapped alignment was disabled, to 16 m 21 s which is 56% faster than BWA with 6 threads.
The 'sampe' core, on the other hand was not affected by the size and the read length of the library and took roughly the same time as the 37 bp library to convert the alignment coordinates. Again, BarraCUDA version of 'sampe' was slightly faster than BWA in this test.
Percentage of mappings
The percentage of confident mappings between BWA and BarraCUDA
37 bp library
76 bp library
Multiple GPU configuration
For computers with multiple CUDA-capable GPUs, BarraCUDA automatically selects the best GPU based on number of stream processors and the amount of graphics memory available to the software. Users can also specify which CUDA device the software is to be executed on by using the '-C' option followed by the device number. In order to take advantage of multiple GPUs in a system, BarraCUDA is accompanied with two scripts, namely 'barracuda-multi-se' and 'barracuda-multi-pe' to align parallel single-end reads and paired-end reads respectively using multiple GPUs. 'barracuda-multi-se' automatically detects the number of CUDA devices in the computer, splits the input.fastq read files according to the number of CUDA devices and calls multiple instances of BarraCUDA to align sequencing reads ('aln' and 'samse') in parallel. Once the alignment finishes, the script joins the files back into one single SAM file. For paired-end reads, 'barracuda-multi-pe' calls two instances of BarraCUDA to align the two paired.fastq read files at the same time and generates a single SAM output using the 'sampe' core. At the time of writing, 'barracuda-multi-pe' does not support more than 2 GPUs while 'barracuda-multi-se' is not bounded by the number of CUDA devices.
Multiple GPUs show a better scalability than CPUs
Here we present BarraCUDA, a next generation sequencing alignment software to perform mapping of sequencing reads to reference genomes using NVIDIA graphics cards. Being based on BWA, BarraCUDA can perform gapped alignment with gap extensions and supports mappings for single- and paired-end reads with comparable alignment accuracies. BarraCUDA also generates alignments in the SAM format for compatibility with downstream data analysis applications.
Due to the limited amount of on-board memory and the tremendous number of threads to handle concurrently, a memory efficient DFS approach was used to perform inexact matches. Although DFS is not as time efficient as BFS utilized in BWA, BarraCUDA still offers a throughput of 6X the speed of a CPU core for gapped alignment and even faster when gap opening is disabled.
We also show here that multiple GPUs scales better than CPUs. A normal computer can easily take up 4 GPUs, meaning that using this test library as an example, a single-end alignment can be done in 5 min, which is twice the speed of a high-end 12-core workstation. Using 8X GPU, we can achieve an alignment speed 3X faster than a traditional computing node with 12 CPU cores, making GPU nodes a more favourable option, in terms of HPC environment, than using those with CPUs.
The software lays an important milestone in low-cost and energy efficient computing in bioinformatics using GPGPU. The software is still under active development and work is underway to further improve the program efficiency.
Availability and requirements
Project Name: BarraCUDA
Project Home Page: http://seqbarracuda.sf.net
Operating System(s): Linux
Programming Language: C/C++, CUDA
Other Requirements: NVIDIA graphics cards with compute capability 1.3 or above, 768 MB VRAM, CUDA toolkit V4.0 or above
Licence: GNU GPL
Any Restrictions to use by non-academics: Nil
Compute unified device architecture
General purpose computing using graphics processing units
We would like to thank NVIDIA for donating their Tesla GPUs and access to their PSG laboratory cluster through the University of Cambridge CUDA Center of Excellence programme. We would also like to thank Thomas Bradley and Timothy Lanfear from NVIDIA for their technical assistances and advices. The work is supported by EurOCHIP FP7 consortium, MRC Centre of Obesity and Related Metabolic Diseases, and a funding from the National Institute for Health Research Cambridge Comprehensive Biomedical Research Centre.
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMedPubMed CentralGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, et al: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452 (7189): 872-876. 10.1038/nature06884.PubMedView ArticleGoogle Scholar
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, et al: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007, 448 (7153): 553-560. 10.1038/nature06008.PubMedPubMed CentralView ArticleGoogle Scholar
- Green RE, Malaspinas AS, Krause J, Briggs AW, Johnson PL, Uhler C, Meyer M, Good JM, Maricic T, Stenzel U, et al: A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell. 2008, 134 (3): 416-426. 10.1016/j.cell.2008.06.021.PubMedPubMed CentralView ArticleGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev. 2009, 10 (10): 669-680. 10.1038/nrg2641.View ArticleGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev. 2009, 10 (1): 57-63. 10.1038/nrg2484.View ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008, 18 (11): 1851-1858. 10.1101/gr.078212.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Smith AD, Chung WY, Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ: Updates to the RMAP short-read mapping software. Bioinformatics. 2009, 25 (21): 2841-2842. 10.1093/bioinformatics/btp533.PubMedPubMed CentralView ArticleGoogle Scholar
- Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC bioinformatics. 2008, 9: 128-10.1186/1471-2105-9-128.PubMedPubMed CentralView ArticleGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713-714. 10.1093/bioinformatics/btn025.PubMedView ArticleGoogle Scholar
- Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009, 25 (11): 1363-1369. 10.1093/bioinformatics/btp236.PubMedPubMed CentralView ArticleGoogle Scholar
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-1967. 10.1093/bioinformatics/btp336.PubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
- Burrows M, Wheeler D: A block-sorting lossless data compression algorithm. 1994, California: Digital Equipment CorporationGoogle Scholar
- Ferragina P, Manzini G: Opportunistic data structures with applications. 41st Annual Symposium on Foundations of Computer Science:. 2000, IEEE, 390-398. ; Redondo Beach, CaliforniaView ArticleGoogle Scholar
- Lippert RA, Mobarry CM, Walenz BP: A space-efficient construction of the Burrows-Wheeler transform for genomic data. J Comput Biol. 2005, 12 (7): 943-951. 10.1089/cmb.2005.12.943.PubMedView ArticleGoogle Scholar
- Schatz MC, Trapnell C, Delcher AL, Varshney A: High-throughput sequence alignment using graphics processing units. BMC bioinformatics. 2007, 8: 474-10.1186/1471-2105-8-474.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Schatz MC: Optimizing data intensive GPGPU computations for DNA sequence alignment. Parallel Computing. 2009, 35 (8): 429-440. 10.1016/j.parco.2009.05.002.PubMedPubMed CentralView ArticleGoogle Scholar
- Ukkonen E: On-line construction of suffix trees. Algorithmica. 1995, 14 (3): 249-260. 10.1007/BF01206331.View ArticleGoogle Scholar
- Delcher AL, Salzberg SL, Phillippy AM: Using MUMmer to identify similar regions in large sequence sets. Current protocols in bioinformatics/editoral board, Andreas D Baxevanis [et al. 2003, 10: 10 13-Google Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinfomatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticleGoogle Scholar
- Consortium TGP: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Lam TW, Sung WK, Tam SL, Wong CK, Yiu SM: Compressed indexing and local alignment of DNA. Bioinformatics. 2008, 24 (6): 791-797. 10.1093/bioinformatics/btn032.PubMedView ArticleGoogle Scholar