CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions
 Yongchao Liu^{1}Email author,
 Bertil Schmidt^{1} and
 Douglas L Maskell^{1}
DOI: 10.1186/17560500393
© Liu et al; licensee BioMed Central Ltd. 2010
Received: 15 January 2010
Accepted: 6 April 2010
Published: 6 April 2010
Abstract
Background
Due to its high sensitivity, the SmithWaterman algorithm is widely used for biological database searches. Unfortunately, the quadratic time complexity of this algorithm makes it highly timeconsuming. The exponential growth of biological databases further deteriorates the situation. To accelerate this algorithm, many efforts have been made to develop techniques in high performance architectures, especially the recently emerging manycore architectures and their associated programming models.
Findings
This paper describes the latest release of the CUDASW++ software, CUDASW++ 2.0, which makes new contributions to SmithWaterman protein database searches using compute unified device architecture (CUDA). A parallel SmithWaterman algorithm is proposed to further optimize the performance of CUDASW++ 1.0 based on the single instruction, multiple thread (SIMT) abstraction. For the first time, we have investigated a partitioned vectorized SmithWaterman algorithm using CUDA based on the virtualized single instruction, multiple data (SIMD) abstraction. The optimized SIMT and the partitioned vectorized algorithms were benchmarked, and remarkably, have similar performance characteristics. CUDASW++ 2.0 achieves performance improvement over CUDASW++ 1.0 as much as 1.74 (1.72) times using the optimized SIMT algorithm and up to 1.77 (1.66) times using the partitioned vectorized algorithm, with a performance of up to 17 (30) billion cells update per second (GCUPS) on a singleGPU GeForce GTX 280 (dualGPU GeForce GTX 295) graphics card.
Conclusions
CUDASW++ 2.0 is publicly available opensource software, written in CUDA and C++ programming languages. It obtains significant performance improvement over CUDASW++ 1.0 using either the optimized SIMT algorithm or the partitioned vectorized algorithm for SmithWaterman protein database searches by fully exploiting the compute capability of commonly used CUDAenabled lowcost GPUs.
Background
Sequence database searches in the field of bioinformatics are used to identify potential evolutionary relationships by means of identifying the similarities between query and subject sequences. The similarities between sequences can be determined by computing their optimal local alignments using the dynamic programming based on the SmithWaterman (SW) algorithm [1, 2]. However, the quadratic time complexity of this algorithm makes it computationally demanding, which is further compounded by the exponential growth of sequence databases. To reduce the execution time, some heuristic solutions, such as FASTA [3] and BLAST [4, 5], have been devised to reduce the execution time, usually producing good results. However, these heuristics might fail to detect some distantly related sequences due to the loss of sensitivity. In this case, the use of high performance architectures, especially the emerging accelerator technologies and manycore architectures such as FPGAs, Cell/BEs and GPUs, becomes one recent trend to execute the SW algorithm, allowing the production of exact results in a reasonably shorter time.
For the FPGA technology, linear systolic array and massively parallel computing using custom instructions are used to perform the SW algorithm. Oliver et al. [6, 7] constructed a linear systolic array to perform the SW algorithm on a standard Virtex II FPGA board using affine gap penalties. Li et al. [8] exploits custom instructions to accelerate the SW algorithm for DNA sequences on an Altera Stratix EP1S40 FPGA by dividing the SW matrix into grids of 8 × 8 cells. For the SIMD vectorization, particularly streaming SIMD extensions 2 (SSE2) technology, there are two basic vectorized SW algorithms available: one computes the algorithm using SIMD vectors parallel to the minor diagonal [9], and the other uses SIMD vectors parallel to the query sequence in a sequential layout [10] or a striped layout [11]. The striped SW approach [11] was then optimized for the Cell/BE [12]. SWPS3 [13] extends this work for the Cell/BE and also improves it for ×86/SSE2 to support multicore processors, and CBESW [14] is designed for the Cell/BEbased PlayStation 3. For generalpurpose GPUs, Liu et al. [15] developed an implementation of the SW algorithm using OpenGL as a first step. After the advent of CUDA programming model, SWCUDA [16] was developed using CUDA, supporting multiple G80 (and higher) GPUs. However, this algorithm distributes the SW computation among multicore CPUs and GPUs, which makes it highly CPU dependent and not able to truly exploit the computation power of GPUs. Different from SWCUDA, CUDASW++ 1.0 [17], designed for multiple G200 (and higher) GPUs, completes all the SW computations on GPUs by fully exploiting the powerful GPUs. To the best of our knowledge, CUDASW++ 1.0 was the fastest publicly available solution to the exact SW algorithm on commodity hardware before this paper.
In this paper, we present the latest release of the CUDASW++ software, CUDASW++ 2.0, which makes new contributions to SW protein database searches using CUDA by deeply exploring the compute power of CUDAenabled GPUs. An optimized SIMT SW algorithm is suggested to further optimize the performance of CUDASW++ 1.0 based on the SIMT abstraction. For the first time have we investigated a partitioned vectorized SW algorithm using CUDA based on the virtualized SIMD abstraction. CUDASW++ 2.0 obtains significant performance improvement over CUDASW++ 1.0 using either the optimized SIMT or the partitioned vectorized algorithms on the same platforms, achieving a performance of up to 17 (30) GCUPS on a singleGPU GeForce GTX 280 (dualGPU GeForce GTX 295) graphics card. In addition, it also outperforms the other previous SW sequence database search implementations on GPUs and some other implementations using SSE2, Cell/B.E or heuristics.
The SmithWaterman algorithm
where sbt is the substitution matrix, ρ is the gap open penalty and σ is the gap extension penalty. A substitution matrix sbt gives the substitution rates of amino acids in proteins, derived from alignments of protein sequences. The recurrences are initialized as H(i, 0) = H(0, j) = E(i, 0) = F(0, j) = 0 for 0 ≤ i ≤ l_{1} and 0 ≤ j ≤ l_{2}. The maximum local alignment score is defined as the maximum score in H. The computation of each cell in H depends on its left, upper, and upperleft neighbors, as shown by the three arrows in Additional file 1. In addition, this data dependency implies that all cells on the same minor diagonal in the alignment matrix are independent, and can be computed in parallel. Thus, the alignment can be computed in minordiagonal order from the topleft corner to the bottomright corner in the alignment matrix, where the computation of minor diagonal i only needs the results of minor diagonals i1 and i2.
CUDA programming model
CUDA is an extension of C/C++ with a minimalist set of abstractions for expressing parallelism, enabling users to write scalable multithreaded parallel code for CUDAenabled GPUs [18]. A CUDA program consist of two parts: a host program running on the host CPU, and one or more parallel kernels which can be executed on GPUs with NVIDIA's Tesla unified graphics and computing architecture [19].
A kernel is written in conventional scalar Ccode, which represents the operations to be performed by a single thread and is invoked as a set of concurrently executing threads. These threads are organized into a grid of thread blocks, where a thread block is a set of concurrent threads. This hierarchical organization has implications for thread communication and synchronization. Threads in a thread block are allowed to synchronize with each other using barriers, and can communicate through a perblock shared memory (PBSM). However, threads located in different thread blocks cannot communicate or synchronize directly. To write efficient CUDA programs, besides the PBSM, it is important to understand the other memory spaces in more detail: noncached global and local memory, cached texture and constant memory as well as onchip registers.
The Tesla architecture is built around a fully programmable scalable processor array, organized into a number of streaming multiprocessors (SMs). Each SM contains eight scalar processors (SPs), sharing a PBSM of size 16 KB. All threads of a thread block are executed concurrently on a single SM. The SM executes threads in small groups of 32 threads, called warps, in an SIMT fashion. When one thread blocks is scheduled to execute on an SM, threads in the thread block are split into warps that get scheduled by the SIMT unit. A warp executes one common instruction at a time, but allows for instruction divergence. When divergence occurs, the warp serially executes each branch path. Thus, parallel performance is generally penalized by datadependent conditional branches and improved if all threads in a warp follow the same execution path. Branch divergence occurs only in a warp, and different warps run independently regardless of common or disjointed code paths they are executing.
Virtualized SIMD vector programming model
Because a warp executes one common instruction at a time, all threads in a warp are implicitly synchronized after executing any instruction. This execution manner is very similar to the characteristic of SIMD vector organizations that a single instruction controls multiple processing elements. Therefore, it is viable to virtualize a warp as an SIMD vector with each thread as a vector element. An alternative virtualization at the warp level is to divide a warp into several thread groups of equal size and then virtualize each thread group as a vector with each thread in the group as an element. However, for the current CUDAenabled GPU technologies, this warplevel virtualization limits the virtualized vector length to 32. To support longer vector lengths, vectors can be virtualized at the threadblock level, where a thread block is considered as a large vector with each thread in the thread block as an element. In this case, the intrinsic function __syncthreads() has to be used to explicitly synchronize all threads at specific synchronization points in the kernel to keep the correctness of the virtualized vector computations.
In this paper, we refer to the virtualized vector as virtualized SIMD vector and its corresponding programming model as virtualized SIMD model to differentiate from the real SIMD vector organizations. Since this virtualization is based on the SIMT model, the virtualized SIMD model shares all the features of the SIMT model with an additional ability to conduct vector computations. We define VL to denote the length of a virtualized vector, i.e. the number of data lanes of the vector. For the convenience of discussion, we assume that the first element (indexed by 0) is on the rightmost and the last element (indexed by VL  1) on the leftmost of the vector. Each thread comprising a virtualized vector is assigned a vector ID vtid that is equal to the position index of its corresponding vector element in the vector of length VL, where 0 ≤ vtid < VL. In this paper, we use warplevel virtualization to implement vectorized SW algorithms.
Methods
Query profile
To calculate H(i, j), the substitution score sbt(S_{1}[i], S_{2}[j]), from the substitution matrix, is added to H(i1, j1). Due to the huge number of iterations in the SW algorithm calculation, reducing the number of instructions needed to perform one cell calculation has a significant impact on the execution time. In this regard, Rognes et al. [10] and Farrar [11] suggested the use of a query profile parallel to the query sequence for each possible residue. A query profile is precalculated just once before database searches, and can be calculated in two layouts: a sequential layout [10] and a striped layout [11].
Even though a sequential query profile is initially designed for SIMD vector computation of the SW algorithm, it is also suitable for scalar computation of the algorithm. For SIMD vector computation, it generally aligns l according to vector length VL for performance consideration and pads Q with dummy residues that have a substitution score of zero between itself and any residue.
Optimized SIMT SmithWaterman algorithm using CUDA
The SIMT SW algorithm used by CUDASW++ 1.0 is designed based on the SIMT abstraction of CUDAenabled GPUs, which enables threadlevel parallelism for independent scalar threads as well as data parallelism for coordinated threads. It uses two stages to complete the database searches: the first stage uses intertask parallelization using threadlevel parallelism, and the second stage uses intratask parallelization using data parallelism. Since the first stage dominates the total runtime when searching large database, the optimizations of CUDASW++ 2.0 are focused on this stage. The performance of CUDASW++ 2.0 is significantly improved due to the following optimizations: introducing a sequential query profile and using a packed data format.
Basic vectorized SmithWaterman algorithm using CUDA
The basic vectorized SW algorithm is designed by directly mapping the striped SW algorithm [11] onto CUDAenabled GPUs using CUDA, based on the virtualized SIMD vector programming model. For the computation of each column of the alignment matrix, the striped SW algorithm consists of two loops: an inner loop calculating local alignment scores postulating that F values do not contribute to the corresponding H values, and a lazyF loop correcting any errors introduced from the calculations of the inner loop. The basic vectorized algorithm uses a striped query profile. In the alignment matrix, for a specific column, the inner loop is completed in T iterations by moving SIMD vectors sequentially through all vector segments of P_{ r }corresponding to this column. For the convenience of discussion, define vecH(i, j), vecE(i, j) and vecF to process the H, E and F values of the cells corresponding to VSEG_{ i }of P_{ r }, where 1 ≤ i ≤ T, for the jth column of the alignment matrix. Using virtualized SIMD vectors, several technical issues have to be addressed for this CUDA algorithm, including saturation algorithmic operations, shift operations and predicate operations on virtualized vectors.
Saturation additions and subtractions are required to calculate alignment scores. Since CUDAenabled graphics hardware lacks support for these operations, maximum and minimum operations are used to artificially implement them. The integer functions max(x, y) and min(x, y), in the CUDA runtime library, are used to avoid divergence. Shift operations on vectors are required both for the inner and lazyF loops. We implement these operations using shared memory, where all threads comprising a virtualized vector writes their original values to a share memory buffer and then reads their resulting values from the buffer as per the number of shift elements. Additional file 4 gives the CUDA pseudocode for shifting a virtualized vector by n elements to the left. As can be seen from the pseudocode, one shift operation is timeconsuming as compared with vector register operations in a real SIMD vector architectures, even though access to shared memory without bank conflicts has a much lower latency than device memory [20].
The lazyF loop requires predicate operations on virtualized vectors when determining whether to continue or exit the loop by checking vecF against the values of vecH(i, j). An approach is to use shared memory to simulate these operations. Although this approach is effective, it is inefficient due to the overhead incurred by the accesses to shared memory. Fortunately, CUDAenabled GPU devices with compute capability 1.2 and higher provide the support for two warp vote functions __all(int) and __any(int), providing an indispensible capability to perform fast predicate operations across all threads within a warp. We use the __all(int) warp vote function to implement the predicate operations on virtualized vectors for the lazyF loop.
The striped query profile is stored in texture memory to exploit the texture cache. Subject sequences and the query profile are stored using the scalar data type in an unpacked fashion because the inner loop is a for loop without manually unrolling. The intermediate element values of vecH(i, j) and vecE(i, j) are stored in global memory, with vecF stored in registers, to support long query sequences. To improve global memory access efficiency, we use the unsigned halfword data type to store the H and E values in global memory.
Partitioned vectorized SmithWaterman algorithm using CUDA
To gain higher performance, we have investigated a partitioned vectorized SW algorithm using CUDA. This algorithm first divides a query sequence into a series of nonoverlapping, consecutive small partitions as per a specified partition length (PL), and then aligns the query sequence to a subject sequence partition by partition. For the partitioned vectorized algorithm, PL must be a multiple of VL. The alignment between one partition of the query sequence and the subject sequence is performed using the basic vectorized algorithm. In this case, because PL is usually set to be relatively smaller, shared memory or registers can be used to store the alignment scores.
Theorem 1. For the partitioned vectorized SW algorithm, the F values of all cells in the first row of the current partition are correctly computed regardless of the correctness of the F values of all cells in the last row of its previous partition.
Proof. Taking cells B and C in Figure 2 as an example, define C_{ F }to denote the F value of C, B_{ H }to denote the H value of B, and B_{ F }to denote the F value of B, where C_{ F }= max(B_{ H } ρ  σ, B_{ F } σ) according to equation (1). For the striped SW algorithm, the correctness of the F value of the last cell for a specific column j in the alignment matrix depends on two possible conditions.
Case 1: the lazyF loop does not stop until the F values of all cells in column j have been corrected because the F value of each cell contributes to its H value. In this case, due to the recalculation of all cells, vecF stores the correct F value of the last cell. Since both B_{ H }and B_{ F }are correct, C_{ F }is definitely correctly calculated.
Case 2: the lazyF loop stops after some iterations with no need to recalculate all cells. This means that the F values of the remaining cells will not contribute to their corresponding H values, but might not be equal to their correct F values directly calculated using equation (1). In this case, because B_{ F } σ ≤ B_{ H } ρ  σ, C_{ F }is equal to B_{ H } ρ  σ so that C_{ F }has no relationship to B_{ F }.
From the above discussion, a conclusion can be drawn that C_{ F }can always be correctly calculated regardless of whether B_{ F }is correct or not. Therefore, the theorem is proven.
Results and discussion
We use GCUPS [17] to measure the performance of our algorithms. In this paper, the execution time t includes the transfer time of the query sequences from host to GPU, the calculation time of the SW algorithm, and the time taken to transferback the scores. In addition, when running on multiple GPUs, t also includes the transfer time of database sequences from host memory to GPU, and time required for creating and destroying the host threads.
The performance of CUDASW++ 2.0 is benchmarked and analyzed by searching for 20 sequences of length from 144 to 5,478 against SwissProt release 56.6 (released on Dec. 16, 2008, comprising 146,166,984 amino acids in 405,506 sequences and having the longest sequence of 35,213 amino acids). The tests on a single GPU are carried out on a GeForce GTX 280 (GTX 280) graphics card, with 30 SMs comprising 240 SPs and 1 GB RAM, installed in a PC with an AMD Opteron 248 2.2 GHz processor running the Linux OS. The multiple GPU tests are carried out on a GeForce GTX 295 (GTX 295) graphics card with two G200 GPUchips on a single card, which consists of 480 SPs (240 SPs per GPU) and 1.8 GB RAM, installed in a PC with an Intel i 7 quadcore 2.67 GHz processor running the Linux OS. This graphics card has a slightly lower clock frequencies compared to GTX 280.
The performance of the optimized SIMT algorithm has no relationship with the substitution matrix and gap penalties used, whereas the two vectorized algorithms are sensitive to them. Generally, for a specific substitution matrix, the higher the gap open and gap extension penalties, the higher the performance. This is because fewer iterations are needed to recalculate F in the lazyF loop. Since the BLOSUM family of substitution matrices, particularly BLOSUM62, is the de facto standard in protein database searches and sequence alignments, if not specified, all the tests in this paper use BLOSUM62 as the substitution matrix by default. The optimized SIMT algorithm exploits a gap penalty of 102 k, and the two vectorized algorithms use several different gap penalties to check the runtime characteristics. For the optimized SIMT algorithm, maximal performance is achieved for a thread block size of 256 threads and a grid size equal to 4× the number of SMs; for the basic vectorized algorithm, maximal performance is achieved using VL equal to warp size (i.e. 32) for a thread block of 256 threads and a grid size equal to 64× the number of SMs; and for the partitioned vectorized algorithm, maximal performance is achieved using PL equal to 256 and VL equal to halfwarp size (i.e. 16) for a thread block of 192 threads and a grid size equal to 128× the number of SMs. The basic vectorized algorithm produces much lower performance than the partitioned vectorized algorithm for several different gap penalties. Additional file 5 shows the performance percentage ratio of the basic vectorized algorithm to the partitioned vectorized one for different gap penalties on a single GPU. Hence, the basic vectorized algorithm is excluded from the release of CUDASW++ 2.0.
Performance evaluation of the optimized SIMT and partitioned vectorized algorithms on GTX 280
Query Sequences  Partitioned  SIMT  

102 k  202 k  403 k  102 k  
Query  Length  Time  GCUPS  Time  GCUPS  Time  GCUPS  Time  GCUPS 
P02232  144  1.58  13.3  1.41  14.9  1.40  15.0  1.38  15.2 
P05013  189  1.80  15.4  1.66  16.7  1.65  16.8  1.75  15.8 
P14942  222  2.01  16.1  1.84  17.6  1.82  17.8  2.00  16.2 
P07327  375  3.97  13.8  3.64  15.1  3.51  15.6  3.35  16.4 
P01008  464  4.57  14.8  4.20  16.1  4.03  16.8  4.05  16.7 
P03435  567  5.87  14.1  5.38  15.4  5.28  15.7  4.94  16.4 
P42357  657  6.64  14.5  6.16  15.6  5.97  16.1  5.00  16.6 
P21177  729  6.92  15.4  6.40  16.6  6.24  17.1  5.77  16.6 
Q38941  850  7.98  15.6  7.37  16.9  7.35  16.9  6.35  16.8 
P27895  1000  10.27  14.2  9.29  15.7  8.74  16.7  7.44  16.7 
P07756  1500  15.07  14.5  14.08  15.6  13.43  16.3  8.64  16.9 
P04775  2005  19.30  15.2  18.05  16.2  17.36  16.9  13.04  16.8 
P19096  2504  22.89  16.0  21.49  17.0  21.19  17.3  17.50  16.7 
P28167  3005  28.54  15.4  26.08  16.8  25.53  17.2  21.89  16.7 
P0C6B8  3564  32.44  16.1  30.56  17.0  29.60  17.6  26.41  16.6 
P20930  4061  40.47  14.7  36.07  16.5  34.31  17.3  31.35  16.6 
P08519  4548  42.41  15.7  39.89  16.7  38.86  17.1  35.84  16.6 
Q7TMA5  4743  42.44  16.3  39.36  17.6  39.30  17.6  40.18  16.5 
P33450  5147  50.91  14.8  47.74  15.8  44.20  17.0  41.92  16.5 
Q9UKN1  5478  55.46  14.4  49.49  16.2  46.66  17.2  45.62  16.5 
Performance evaluation of the optimized SIMT and partitioned vectorized algorithms on GTX 295
Query Sequences  Partitioned  SIMT  

102 k  202 k  403 k  102 k  
Query  Length  Time  GCUPS  Time  GCUPS  Time  GCUPS  Time  GCUPS 
P02232  144  1.19  17.7  1.13  18.7  1.09  19.4  1.02  20.7 
P05013  189  1.34  20.7  1.30  21.4  1.26  22.1  1.25  22.3 
P14942  222  1.49  22.0  1.41  23.1  1.38  23.7  1.37  23.8 
P07327  375  2.77  19.9  2.58  21.4  2.42  22.8  2.15  25.7 
P01008  464  3.04  22.4  2.82  24.2  2.66  25.6  2.54  26.8 
P03435  567  3.93  21.2  3.61  23.1  3.49  23.9  3.11  26.8 
P42357  657  4.29  22.5  4.02  24.0  3.87  25.0  3.56  27.1 
P21177  729  4.53  23.7  4.22  25.4  4.04  26.5  3.90  27.5 
Q38941  850  5.03  24.9  4.66  26.8  4.63  27.0  4.53  27.6 
P27895  1000  6.58  22.3  5.87  25.1  5.38  27.3  5.21  28.2 
P07756  1500  9.86  22.4  9.19  24.0  8.58  25.7  7.72  28.6 
P04775  2005  12.26  24.1  11.32  26.0  10.79  27.3  10.26  28.7 
P19096  2504  14.32  25.7  13.34  27.6  12.99  28.4  12.79  28.8 
P28167  3005  18.31  24.1  16.46  26.9  15.56  28.4  15.33  28.8 
P0C6B8  3564  21.09  24.9  19.34  27.1  17.99  29.1  18.20  28.8 
P20930  4061  26.75  22.3  23.35  25.6  20.76  28.8  20.77  28.8 
P08519  4548  27.36  24.4  25.11  26.6  23.92  28.0  23.24  28.8 
Q7TMA5  4743  25.86  27.0  23.57  29.6  23.51  29.7  24.24  28.8 
P33450  5147  32.69  23.2  30.57  24.8  27.37  27.7  26.33  28.7 
Q9UKN1  5478  36.61  22.0  32.40  24.9  28.88  27.9  28.05  28.7 
Performance comparison between CUDASW++ 1.0, CUDASW++ 2.0 and NCBIBLAST
Software  Performance  

Time(h)  GCUPS  
Optimized SIMT (BL62, 102 k)  8.00  28.8 
Partitioned (BL62, 102 k)  11.15  20.7 
Partitioned (BL50, 103 k)  11.71  19.7 
NCBIBLAST(BL62, 102 k)  9.56  24.1 
NCBIBLAST(BL50, 103 k)  51.45  4.5 
CUDASW++ 1.0 (BL62, 102 k)  14.12  16.3 
Conclusions
In this paper, we have presented our new contributions to SW database searches using CUDA, through the latest release of the CUDASW++ 2.0 software targeted for CUDAenabled GPUs with compute capability 1.2 and higher. An optimized SIMT SW algorithm is suggested to further optimize the performance of CUDASW++ 1.0 based on the SIMT abstraction of CUDAenabled GPUs. For the first time we have investigated a partitioned vectorized SW algorithm using CUDA based on the virtualized SIMD abstraction of CUDAenabled GPUs. This virtualized SIMD vector programming model provides guidance for designing other bioinformatics algorithms, such as pairwise distance computation in ClustalW [21, 22], using SIMD vectorization for CUDAenabled GPUs. The optimized SIMT and the partitioned vectorized algorithms have remarkably similar performance characteristics when benchmarked by searching the SwissProt release 56.6 database with query sequences of length varying from 144 to 5,478. The optimized SIMT algorithm produces reasonably stable performance, while the partitioned vectorized algorithm has some small fluctuations around the average performance for a specific gap penalty, increasing with the increase of the gap open and gap extension penalties. CUDASW++ 2.0 provides direct support for multiple GPU devices installed in a single host. It obtains significant performance improvement over CUDASW++ 1.0 using either the optimized SIMT algorithm or the partitioned vectorized algorithm on the same platform, achieving a highest performance of up to 17 (30) GCUPS on GTX 280 (GTX 295).
Even though the optimal alignment scores of the SW algorithm can be used to detect related sequences, the scores are biased by sequence length and composition. The Zvalue [23–25] has been proposed to estimate the statistical significance of these scores. However, the computation of Zvalue requires the calculating of a large set of pairwise alignments between random permutations of the sequences compared, which is highly timeconsuming. The acceleration of Zvalue computation with CUDA is therefore part of our future work.
Availability and requirements

Project name: CUDASW++

Project home page: http://cudasw.sourceforge.net/

Operating System: Linux

Programming language: CUDA and C++

Other requirements: CUDA SDK and Toolkits 2.0 or higher; CUDAenabled GPUs with compute capability 1.2 and higher

License: none
List of abbreviations
 CPU:

Central Processing Unit
 CUDA:

Compute Unified Device Architecture
 Cell/BE:

Cell Broadband Engine Architecture
 FPGA:

FieldProgrammable Gate Array
 GCPUS:

Billion Cell Updates per Second
 GPU:

Graphics Processing Unit
 GTX 280:

NVIDIA GeForce GTX 280
 GTX 295:

NVIDIA GeForce GTX 295
 OpenGL:

Open Graphics Library
 OS:

Operating System
 PBSM:

Perblock Shared Memory
 PC:

Personal Computer
 RAM:

Random Access Memory
 SIMD:

Single Instruction Multiple Data
 SIMT:

Singleinstruction, Multiplethread
 SM:

Streaming Multiprocessor
 SP:

Scalar Processor
 SSE2:

Streaming SIMD Extensions 2
 SW:

SmithWaterman.
Declarations
Acknowledgements
The authors would like to thank Dr. Liu Weiguo and Dr. Shi Haixiang for helping to provide the experimental environments for conducting the tests.
Authors’ Affiliations
References
 Smith T, Waterman M: Identification of common molecular subsequences. J Molecular Biology. 1981, 147: 195197. 10.1016/00222836(81)900875.View ArticleGoogle Scholar
 Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol. 1982, 162: 707708. 10.1016/00222836(82)903989.View ArticleGoogle Scholar
 Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Nat Acad Sci USA. 1988, 85 (8): 24442448. 10.1073/pnas.85.8.2444.PubMed CentralPubMedView ArticleGoogle Scholar
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403410.PubMedView ArticleGoogle Scholar
 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
 Oliver T, Schmidt B, Nathan D, Clemens R, Maskell D: Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW. Bioinformatics. 2005, 21 (16): 34313432. 10.1093/bioinformatics/bti508.PubMedView ArticleGoogle Scholar
 Oliver T, Schmidt B, Maskell DL: Reconfigurable architectures for biosequence database scanning on FPGAs. IEEE Trans Circuit Syst II. 2005, 52: 851855. 10.1109/TCSII.2005.853340.View ArticleGoogle Scholar
 Li TI, Shum W, Truong K: 160fold acceleration of the SmithWaterman algorithm using a field programmable gate array (FPGA). BMC Bioinformatics. 2007, 8: I8510.1186/14712105885.View ArticleGoogle Scholar
 Wozniak A: Using videooriented instructions to speed up sequence comparison. Comput Appl Biosci. 1997, 13 (2): 145150.PubMedGoogle Scholar
 Rognes T, Seeberg E: Sixfold speedup of SmithWaterman sequence database searches using parallel processing on common microprocessors. Bioinformatics. 2000, 16 (8): 699706. 10.1093/bioinformatics/16.8.699.PubMedView ArticleGoogle Scholar
 Farrar M: Striped SmithWaterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007, 23 (2): 156161. 10.1093/bioinformatics/btl582.PubMedView ArticleGoogle Scholar
 Farrar MS: Optimizing SmithWaterman for the Cell broadband engine. [http://farrar.michael.googlepages.com/SWCellBE.pdf]
 Szalkowski A, Ledergerber C, Krahenbuhl P, Dessimoz C: SWPS3  fast multithreaded vectorized SmithWaterman for IBM Cell/B.E. and ×86/SSE2. BMC Research Notes. 2008, 1: 10710.1186/175605001107.PubMed CentralPubMedView ArticleGoogle Scholar
 Wirawan A, Kwoh CK, Hieu NT, Schmidt B: CBESW: sequence alignment on Playstation 3. BMC Bioinformatics. 2008, 9: 37710.1186/147121059377.PubMed CentralPubMedView ArticleGoogle Scholar
 Liu W, Schmidt B, Voss G, MullerWittig W: Streaming algorithms for biological sequence alignment on GPUs. IEEE Transactions on Parallel and Distributed Systems. 2007, 18 (9): 12701281. 10.1109/TPDS.2007.1059.View ArticleGoogle Scholar
 Manavski SA, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment. BMC Bioinformatics. 2008, 9 (Suppl 2): S1010.1186/147121059S2S10.PubMed CentralPubMedView ArticleGoogle Scholar
 Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units. BMC Research Notes. 2009, 2: 7310.1186/17560500273.PubMed CentralPubMedView ArticleGoogle Scholar
 Nickolls J, Buck I, Garland M, Skadron K: Scalable parallel programming with CUDA. ACM Queue. 2008, 6 (2): 4053. 10.1145/1365490.1365500.View ArticleGoogle Scholar
 Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 2008, 28 (2): 3955. 10.1109/MM.2008.31.View ArticleGoogle Scholar
 NVIDIA CUDA programming guide version 2.0. [http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf]
 Thompson JD, Higgins DG, Gibson TJ: CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 46734680. 10.1093/nar/22.22.4673.PubMed CentralPubMedView ArticleGoogle Scholar
 Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace1 IM, Wilm A, Lopez R, Thompson JD, Gibson JD, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics Applications Note. 2007, 23 (21): 29472948.View ArticleGoogle Scholar
 Comet JP, Aude JC, Glémet E, Risler JL, Hénaut A, Slonimski PP, Codani JJ: Significance of Zvalue statistics of SmithWaterman scores for protein alignments. Computers & Chemistry. 1999, 23 (34): 317331.View ArticleGoogle Scholar
 Bastien O, Aude JC, Roy S, Maréchal E: Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Zvalue statistics. Bioinformatics. 2004, 20 (4): 534537. 10.1093/bioinformatics/btg440.PubMedView ArticleGoogle Scholar
 Peris G, Marzal A: A Screening Method for ZValue Assessment Based on the Normalized Edit Distance. Lecture Notes in Computer Science. 2009, 5518: 11541161. full_text.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.