### Query profile

To calculate *H*(*i*, *j*), the substitution score *sbt*(*S*_{1}[*i*], *S*_{2}[*j*]), from the substitution matrix, is added to *H*(*i*-1, *j*-1). Due to the huge number of iterations in the SW algorithm calculation, reducing the number of instructions needed to perform one cell calculation has a significant impact on the execution time. In this regard, Rognes et al. [10] and Farrar [11] suggested the use of a query profile parallel to the query sequence for each possible residue. A query profile is pre-calculated just once before database searches, and can be calculated in two layouts: a sequential layout [10] and a striped layout [11].

Given a query sequence *Q* of length *l* defined over an alphabet Σ, a query profile is defined as a numerical string set *P* = {*P*_{
r
}| *r* ∈ Σ}, where *P*_{
r
}is a numeric string consisting of substitution scores that are required to compute a complete column (or row) of the alignment matrix, and the values of *P*_{
r
}depend on whether it uses a sequential or a striped layout. For a sequential query profile (see Additional file 2), *P*_{
r
}(*i*), the *i*-th value of *P*_{
r
}, is defined as

Even though a sequential query profile is initially designed for SIMD vector computation of the SW algorithm, it is also suitable for scalar computation of the algorithm. For SIMD vector computation, it generally aligns *l* according to vector length *VL* for performance consideration and pads *Q* with dummy residues that have a substitution score of zero between itself and any residue.

A striped query profile (see Additional file 3) is designed for SIMD vector computation. To construct a striped query profile, given a vector length *VL*, the query sequence *Q* is divided into a set of equal length query segments *QSEG* = {*QSEG*_{1}, *QSEG*_{2}, ..., *QSEG*_{
VL
}} of *VL* elements. The length *T* of each query segment is equal to (*l* + *VL* - 1)/*VL*. If *l* is not a multiple of *VL*, *Q* is first padded with dummy residues. For simplicity, we assume *l* is a multiple of *VL*. Correspondingly, each numerical string *P*_{
r
}of a striped query profile can be considered as a set of non-overlapping, consecutive *VL*-length vector segments *VSEG* = {*VSEG*_{1}, *VSEG*_{2}, ..., *VSEG*_{
T
}} of *T* elements, where the *i*-th element of *VSEG*_{
j
}maps the *j*-th element of *QSEG*_{
i
}. Hence, *P*_{
r
}(*i*) of a striped query profile is defined as

### Optimized SIMT Smith-Waterman algorithm using CUDA

The SIMT SW algorithm used by CUDASW++ 1.0 is designed based on the SIMT abstraction of CUDA-enabled GPUs, which enables thread-level parallelism for independent scalar threads as well as data parallelism for coordinated threads. It uses two stages to complete the database searches: the first stage uses inter-task parallelization using thread-level parallelism, and the second stage uses intra-task parallelization using data parallelism. Since the first stage dominates the total runtime when searching large database, the optimizations of CUDASW++ 2.0 are focused on this stage. The performance of CUDASW++ 2.0 is significantly improved due to the following optimizations: introducing a sequential query profile and using a packed data format.

A sequential query profile stored in texture memory is used to replace random access to the substitution matrix in shared memory. Inspired by the fact that texture instructions output filtered samples, typically a four-component (RGBA) color [19], the sequential query profile is re-organized using a packed data format, where each numerical string *P*_{
r
}is packed and represented using the *char* 4 vector data type, instead of the *char* scalar data type. In this way, four substitution scores are realized using only one texture fetch, thus significantly improving texture memory throughput. Like the query profile, each subject sequence is also re-organized using a packed data format, where four successive residues of each subject sequence are packed together and represented using the *uchar4* vector data type. In this case, when using the cell block division method, the four residues loaded by one texture fetch are further stored in shared memory for the use of the inner loop (see the pseudocode of the CUDA kernel shown in Figure 1). In addition, some redundant operations are removed to improve instruction pipeline throughput.

### Basic vectorized Smith-Waterman algorithm using CUDA

The basic vectorized SW algorithm is designed by directly mapping the striped SW algorithm [11] onto CUDA-enabled GPUs using CUDA, based on the virtualized SIMD vector programming model. For the computation of each column of the alignment matrix, the striped SW algorithm consists of two loops: an inner loop calculating local alignment scores postulating that *F* values do not contribute to the corresponding *H* values, and a lazy-F loop correcting any errors introduced from the calculations of the inner loop. The basic vectorized algorithm uses a striped query profile. In the alignment matrix, for a specific column, the inner loop is completed in *T* iterations by moving SIMD vectors sequentially through all vector segments of *P*_{
r
}corresponding to this column. For the convenience of discussion, define *vecH*(*i*, *j*), *vecE*(*i*, *j*) and *vecF* to process the *H*, *E* and *F* values of the cells corresponding to *VSEG*_{
i
}of *P*_{
r
}, where 1 ≤ *i* ≤ *T*, for the *j*-th column of the alignment matrix. Using virtualized SIMD vectors, several technical issues have to be addressed for this CUDA algorithm, including saturation algorithmic operations, shift operations and predicate operations on virtualized vectors.

Saturation additions and subtractions are required to calculate alignment scores. Since CUDA-enabled graphics hardware lacks support for these operations, maximum and minimum operations are used to artificially implement them. The integer functions *max(x, y)* and *min(x, y)*, in the CUDA runtime library, are used to avoid divergence. Shift operations on vectors are required both for the inner and lazy-F loops. We implement these operations using shared memory, where all threads comprising a virtualized vector writes their original values to a share memory buffer and then reads their resulting values from the buffer as per the number of shift elements. Additional file 4 gives the CUDA pseudocode for shifting a virtualized vector by *n* elements to the left. As can be seen from the pseudocode, one shift operation is time-consuming as compared with vector register operations in a real SIMD vector architectures, even though access to shared memory without bank conflicts has a much lower latency than device memory [20].

The lazy-F loop requires predicate operations on virtualized vectors when determining whether to continue or exit the loop by checking *vecF* against the values of *vecH*(*i*, *j*). An approach is to use shared memory to simulate these operations. Although this approach is effective, it is inefficient due to the overhead incurred by the accesses to shared memory. Fortunately, CUDA-enabled GPU devices with compute capability 1.2 and higher provide the support for two warp vote functions __*all*(*int*) and *__any*(*int*), providing an indispensible capability to perform fast predicate operations across all threads within a warp. We use the __*all*(*int*) warp vote function to implement the predicate operations on virtualized vectors for the lazy-F loop.

The striped query profile is stored in texture memory to exploit the texture cache. Subject sequences and the query profile are stored using the scalar data type in an unpacked fashion because the inner loop is a *for* loop without manually unrolling. The intermediate element values of *vecH*(*i*, *j*) and *vecE*(*i, j*) are stored in global memory, with *vecF* stored in registers, to support long query sequences. To improve global memory access efficiency, we use the unsigned half-word data type to store the *H* and *E* values in global memory.

### Partitioned vectorized Smith-Waterman algorithm using CUDA

To gain higher performance, we have investigated a partitioned vectorized SW algorithm using CUDA. This algorithm first divides a query sequence into a series of non-overlapping, consecutive small partitions as per a specified partition length (*PL*), and then aligns the query sequence to a subject sequence partition by partition. For the partitioned vectorized algorithm, *PL* must be a multiple of *VL*. The alignment between one partition of the query sequence and the subject sequence is performed using the basic vectorized algorithm. In this case, because *PL* is usually set to be relatively smaller, shared memory or registers can be used to store the alignment scores.

In this algorithm, it considers each partition as a new query sequence and constructs a striped query profile for each partition. However, this partitioned vectorized algorithm makes the alignment of one partition depend on the alignment scores of the previous partition in the alignment matrix (see Figure 2). More specifically, the computation of the *H* and *F* values of the first row of one partition depends on the *H* and *F* values of the last row of the previous partition (note that the positions of the first and the last rows are kept unchanged for a striped query profile regardless of the specific values of *PL* and *VL*). In this case, after having completed the computation of one column of a partition, the *H* and *F* values of the last cell in this column have to be saved for the future use of the next partition. For performance consideration, instead of directly storing *F* value of the last cell of the partition, we store the *F* value of the first cell of the next partition, calculated from the *H* and *F* values of the last cell in the current partition. However, there is a problem with the striped algorithm in that for a specific column, after exiting from the lazy-F loop, it makes sure that the *H* and *E* values of all cells are correct, but provides no guarantee that *vecF* stores the correct *F* value of the last cell. This is because the lazy-F loop is likely to quit, with high probability, with no need to re-calculate all the cell values. Since the *H* values of all cells in the last row of the previous partition are always correct, the correctness of our partitioned vectorized algorithm will be proved if we could prove that the correctness of the *F* values of all cells in the last row of the previous partition does not affect the correct computation of *F* values of all cells in the first row of the current partition. In the following, we will prove that the *F* values of all cells in the first row of the current partition can always be correctly calculated.

**Theorem 1**. For the partitioned vectorized SW algorithm, the *F* values of all cells in the first row of the current partition are correctly computed regardless of the correctness of the *F* values of all cells in the last row of its previous partition.

**Proof**. Taking cells *B* and *C* in Figure 2 as an example, define *C*_{
F
}to denote the *F* value of *C*, *B*_{
H
}to denote the *H* value of *B*, and *B*_{
F
}to denote the *F* value of *B*, where *C*_{
F
}= *max*(*B*_{
H
}- *ρ* - *σ*, *B*_{
F
}- *σ*) according to equation (1). For the striped SW algorithm, the correctness of the *F* value of the last cell for a specific column *j* in the alignment matrix depends on two possible conditions.

*Case* 1: the lazy-F loop does not stop until the *F* values of all cells in column *j* have been corrected because the *F* value of each cell contributes to its *H* value. In this case, due to the recalculation of all cells, *vecF* stores the correct *F* value of the last cell. Since both *B*_{
H
}and *B*_{
F
}are correct, *C*_{
F
}is definitely correctly calculated.

*Case* 2: the lazy-F loop stops after some iterations with no need to recalculate all cells. This means that the *F* values of the remaining cells will not contribute to their corresponding *H* values, but might not be equal to their correct *F* values directly calculated using equation (1). In this case, because *B*_{
F
}- *σ* ≤ *B*_{
H
}- *ρ* - *σ*, *C*_{
F
}is equal to *B*_{
H
}- *ρ* - *σ* so that *C*_{
F
}has no relationship to *B*_{
F
}.

From the above discussion, a conclusion can be drawn that *C*_{
F
}can always be correctly calculated regardless of whether *B*_{
F
}is correct or not. Therefore, the theorem is proven.

The partitioned vectorized algorithm stores the values of *vecH* and *vecE* of one column for a partition in registers to achieve peak performance. Using registers, the inner loop and the lazy-F loops are manually unrolled according to the number of vector segments. Since the computation of the inner loop is fully unrolled, the access to the striped query profile can be optimized using the packed data format. Like the optimized SIMT algorithm, each numerical string *P*_{
r
}of this query profile is packed and represented using the *char* 4 vector data type according to the access order of threads comprising a virtualized vector. However, the subject sequences are not packed because we failed to find performance improvement using the packed data format. Figure 3 shows the pseudocode of the CUDA kernel of the partitioned vectorized algorithm.