Next-generation sequencing (NGS) is a technique based on sequencing by synthesis or sequencing by ligation in a massively parallel fashion and generates short sequencing reads ranging from 25 to 400 bp. The first commercially available next-generation sequencer, the Genome Sequencer 20 was released by 454 Life Sciences in 2005 [1] with the hope of enabling the analysis of complete genomes within a short period of time [2]. It produced a throughput of 20 megabases from a 5-hour run, which was 30 fold higher than traditional Sanger capillary electrophoresis systems. Over these years, the output of next-generation sequencers has increased 30, 000 fold to 600 gigabases from a single instrument run (Illumina Hiseq 2000) which is about 200 times the depth of coverage of an entire human genome.
The advancement of the technology has generated an enormous amount of sequence data [3]. Sequence alignment is one of the first steps for downstream data analyses, during which sequencing reads have to be mapped either to other reads to form a genome (also known as de novo sequence assembly) [1, 4]; or on to a reference DNA sequence, usually a genome, for downstream applications such as single-nucleotide polymorphism (SNP) discovery [3], Chip-Seq [5] or RNA-Seq [6]. Here we focus on the latter type of alignments.
A handful of software packages have been developed for computing alignments of sequencing reads generated by NGS instruments on to a reference sequence. Early generation aligners such as MAQ [7], RMAP [8, 9] and Soap [10] use hash-based algorithms to perform the mapping of sequencing reads on to reference genomes. Even though these tools can be accelerated by several seeding approaches, or parallelized by using multiple cores and cloud computing (e.g. CloudBurst [11]), the computational cost remains expensive. For example, by extrapolating the result from Schtaz et al. [11], it would take over 30, 000 CPU hours to align reads generated from a single HiSeq 2000 run.
Later in 2009, a new generation of sequence aligners were released, namely Soap2 [12], Bowtie [13] and BWA [14]. These tools use a suffix tree-based algorithm (also known as FM-index) based on Burrows-Wheeler Transform (BWT). BWT is a block-sorting algorithm originally designed for lossless data compression [15], it can also be used for string matching by a backward search approach [16]. The major advantages of this approach include a low time complexity of O(n) to find an exact match where n is the length of the query [16] and the performance is independent from the size of the reference sequence. In addition, a high compression ratio of 2.5 bit per base for the reference genome [17] also means a full human genome can fit into 1.3 GB of space. The new algorithm is a magnitude quicker and much more memory efficient than their hash-based predecessors [14]. In an experiment performed by Li and Durbin [14] the alignment throughput for 12.2 million 51 bp reads being mapped to the Human genome went down from 94 CPU hours (MAQ) to 4 CPU hours (BWA) on a 2.5 GHz Intel Harpertown-based Xeon E5420 while retaining comparable accuracy in alignment mapping.
Modern graphics cards are designed primarily for rendering real time, high-definition complex 3D graphics features for various visual applications such as gaming and graphics design. Each graphics processing unit, or GPU, consists of many high performance stream processors capable of performing billions of independent calculations per second in order to meet the high visualization demand required for graphics applications. It is this processing capability that can be translated into a general-purpose computation capability equivalent to a small cluster of traditional CPUs. In addition, the lower energy profile and cost means the use of GPU to perform parallel computing tasks has become increasingly attractive. Many modern supercomputers including the Chinese Tianhe-1A, Nebulae and Japanese Tsubame 2.0 (http://www.top500.org/lists/2010/11) also contain multiple GPU nodes on top of traditional nodes with CPUs to take advantage to the parallel computing capability of GPUs.
MUMmerGPU [18, 19] is one of the first GPGPU-based DNA alignment software that utilizes the NVIDIA Compute Unified Device Architecture (CUDA) to perform variable-length maximal exact DNA alignments. Unlike other BWT aligners it uses a different suffix-tree approach, namely Ukkonen's algorithm [20] to find exact matches in the reference sequence of all the sub-strings in the query DNA sequence. The current version of MUMmer GPU out-performs its CPU counterpart by 13 fold [18, 19]. However, unlike other sequence aligners mentioned in the previous section, MUMmerGPU does not support inexact alignments by itself and has to be used in conjunction with the original MUMmer software package [21] to perform inexact alignments.
Here we introduce BarraCUDA, a program that takes advantage of GPGPU to perform inexact alignment of sequencing reads on to a reference sequence. BarraCUDA is built on the foundation of BWA and we have rewritten the BWT-based alignment core module to make use of the massive parallelism of GPGPU. It also employs the fast and memory efficient BWT-based algorithm employed in the original software and supports mismatches and full gapped alignment of sequencing reads.
Software implementation
BarraCUDA utilises NVIDIA's GPGPU implementation, namely Compute Unified Device Architecture (CUDA) to parallelise the alignment of sequence reads. Firstly, the program loads the complete BWT-encoded reference sequence and sequence reads from disk to GPU memory; This is followed by launching a GPU alignment kernel, where the alignment task of each of the sequence reads are distributed to hundreds of processors within the GPU and computations are performed in parallel; Once the kernel finishes, the alignment results are transferred from GPU back to disk. The following sections describe the details of each of the steps performed in BarraCUDA. (Please also refer to the Additional file 1 for the pseudo-code algorithm framework)
1. Transferring BWT-encoded reference sequence and sequence reads from disk to GPU
BarraCUDA first loads the full BWT suffix array from disk into cached texture memory in the GPU using a 1-dimensional uint4 array to ensure fast data access. Sequence reads are loaded into GPU memory in batches and packed in a single continuous block to minimise internal fragmentations, and the data is also bound to the texture cache to maximise the data throughput.
2. CUDA thread assignments
Mapping a sequence read to a reference sequence is a data independent process and does not require any information from any of the other reads, thus BarraCUDA employs a straightforward data parallelism by assigning an alignment kernel thread to each of the individual sequencing reads and launching the GPU kernel with tens of thousands of threads at the same time.
3. Inexact sequence alignment--a depth-first search GPU kernel
The alignment kernel in BarraCUDA, like BWA and all other BWT-based sequence aligners, consists of a backward search string-matching algorithm [13, 14] to look for alignments. Inexact alignment requires a search space of O(9 n ) in order to generate and evaluate a series of base substitutions that could lead to an exact string match (Figure 1a). BarraCUDA differs from BWA in its strategy to perform search traversal. While BWA utilises a time efficient breadth-first search (BFS) approach as shown in Figure 1b, it can utilize a maximum of 40 MB of memory space for each computational thread (please refer to the Additional file 1 for explanations). With thousands of concurrent threads on the GPU, the memory to each thread is very limited and BFS does not seem to be an option. Therefore in BarraCUDA, we adopted a difference-bound DFS approach (as outlined in Figure 1c) where it uses 20, 000 fold less memory to perform alignments (see Additional file 1). Rather than storing all partial hits while traversing through the search space, the DFS algorithm only stores in memory the branch of the search space with the local best hit score, i.e. (1), (1, 1) followed by (1, 1, 1) to give a full alignment (Figure 1c). Nonetheless, DFS is not as time efficient as BFS employed in BWA, BarraCUDA has to re-assess nodes multiple times until all possible hits from that node are evaluated, for instance in Figure 1c, the program has to return from (1, 1, 1) to (1, 1) in order to reach (1, 1, 3) as the next best hit, and this is particular a problem when the read length is long and the search space is extremely large.
4. Multiple kernel design
Long sequence reads are divided into short fragments 32 bp (default seed length) and alignment is performed by multiple consecutive DFS kernel runs. Figure 2 illustrates an example of such approach with the alignment of a 5 bp read, the first DFS kernel (GPU thread A) only maps up to the third base (nodes 3.1, 5.1 and 6.1), and partial alignments at this point are returned from the GPU and stored in a temporary memory stack on the host computer (rounded square). After that, similar to BWA's BFS, all partial hits in the host memory store are ranked by their number of differences by the host code where the best hits are prioritised for the subsequent kernel launch, (i.e. GPU thread B, GPU thread C followed by GPU thread D). By doing this, we could reduce significantly the number of revisits by the DFS agent to the length of the fragments (32 bp), and thus lower the serialisations caused by thread divergence compared to alignment using one single kernel. In addition, this could also allow us to discard partial hits that have more than z + 1 differences on the memory stack, as in BWA, when a full alignment is found, to increase the computing efficiency.
5. Alignment data management
During runtime, the alignment data for each of the sequence reads including BWT suffix array (SA) coordinates and the number of differences is stored temporarily in GPU memory. Once the kernel finishes, the data are copied from the GPU back to the host and subsequently written onto disk storage in a binary file format. Similar to BWA, the SA coordinates can then be translated to linear space using BarraCUDA's 'samse' or 'sampe' cores for single-end or paired-end libraries respectively.