Accelerating large-scale protein structure alignments with graphics processing units

Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU.


Background
Large-scale protein structure comparison is becoming a more and more important approach to providing a better picture for understanding biological systems [1,2]. Given a database of protein structures, the main goal is either to find proteins that are structurally similar to a given protein (i.e., one-against-all comparison) or to build various connectivity among proteins by performing exhaustive comparisons on the whole database (i.e., allagainst-all comparison). The results of structural comparison are useful in discovering potential structural, evolutionary, and functional relationships among these proteins and have significant impact on structure-based drug design [3], protein-protein docking [4], and other biological findings [5]. Recently, the dramatic increase in protein structural data [6] has led to an ever increasing demand for structure alignment tools that can not only find accurate alignments at residue level but also complete large-scale structure comparisons in a reasonable time.
Several approaches have been developed to address the limitations of traditional alignment methods and tackle the computational issues. The traditional alignment methods [1,2,7], such as DALI [8], CE [9], TMalign [10], Fr-TM-align [11], and MAMMOTH [12], are based on the comparison of residues or fragments to build initial alignments which are optimized by various procedures, such as Monte-Carlo, combinational search, and dynamic programming. These methods can provide accurate alignments at the residue level but are usually computationally expensive, which makes them infeasible in coping with very large datasets. To accelerate this process, one approach is to map the protein structures into 1D sequences and then use various sequence alignment methods to align two structures [13,14]. Another approach [15] utilizes a "bag of words" method, which depends on frequency of specific structural patterns, to provide speedy structure match and filtering. These approaches significantly improve efficiency for large datasets; however, this is often achieved at the cost of loss of topological details, which could lead to lower accuracy than the traditional structural comparison methods or could be unsuitable to perform residue-level alignment. Another approach is to parallelize traditional algorithms using a cluster or grid environment consisting of thousands of computing nodes [16,17]. These approaches can fulfill the desires of efficiency and accuracy but require high-performance computing environments which are energy-consuming and may not be accessible to the biologists.
With the increase in performance and programmability of many-core Graphic Processing Units (GPUs), more and more bioinformatics applications have been deployed on GPUs and have shown promising results in terms of speedup over their conventional CPU implementations. Liu et al. [18] implemented a GPU-based Smith-Waterman algorithm [19] for pair-wise DNA sequence alignment. Later, the efficiency of sequence alignments has been continuously improved in [20][21][22][23]. Vouzis and Sahinidis developed GPU-BLAST (Basic Local Alignment Search Tool) [24] to accelerate NCBI-BLAST [25]. Hung et al. developed a method for calculating RMSD (Root Mean Square Deviation) after superposition for ATI GPU card [26]. Stivala et al. utilized simulated annealing (SA) to develop a protein substructure searching algorithm, SA Tableau Search, to find structural motif at level of secondary structure element (SSE) [27]. It is worth mentioning that from the literature the SA Tableau Search is the first attempt to apply GPU in protein structure comparison at the SSE level. Other applications include protein-protein docking [28] and statistical phylogenetics [29].
In this paper, we present ppsAlign, a parallel protein structure Alignment framework which is designed and optimized to exhaustively exploit the parallelism of the GPU architecture for residue-level structure comparisons. Our experimental results (reported on a NVIDIA Tesla C2050 GPU card) show that ppsAlign significantly outperforms existing structural alignment tools in computational efficiency.
We believe that GPU's massive parallel computing power can unlock the door to a cost-effective and highperformance computing environment that can be beneficial to the structural biology community.

Overview
The framework of ppsAlign is shown in Figure 1. The inputs include a target protein and a protein database Λ = {P 1 , P 2 , ...., P n }. The outputs are structure alignments between the target protein and each database protein.
The online alignment starts with a generation of some initial sets of matched fragments and corresponding alignments. Then, the initial alignments are extended and refined using Dynamic Programming to obtain the final results. Specifically, the ppsAlign algorithm consists of 5 steps: 1) Index-based matched fragment set (MFS) search is utilized to find the maximal N seed seed MFS' between the target protein and each database protein; 2) Fragment-level alignment is used to assemble the MFS' and generate initial alignments; 3) Residue-level alignment is used to refine the initial alignments to residue alignments; 4) Maximal alignment search is used to find a transformation that can best superimpose the entire target protein over each database protein based on the obtained residue alignments; 5) Final assessment is performed to calculate z-Score and evaluate statistical significance of alignments. Steps 1) and 5) are executed on the CPU core, while steps 2)~4), the most time-consuming parts of ppsAlign, are implemented as GPU kernels and iteratively executed on GPU for N iter times. The GPU kernels are developed using CUDA (Compute Unified Device Architecture) programming model [30]. During the alignment, the protein structures and intermediate results from each GPU kernel are stored in GPU's on-board memory, such as read-only constant memory, read-only texture memory, and read-write global memory. Generally, the constant and texture memory have limited capacity but high access rate compared to the global memory. For an overview of GPU architecture and CUDA model, readers are referred to [30,31]. To facilitate the search of structurally similar fragments from the protein database, ppsAlign has an off-line component that pre-processes substructures from the entire protein database and builds an indexing tree to allow fast retrievals.

Index-based matched fragment set search
The purpose of this CPU-based step is to quickly find all possible matched fragment sets (MFS') between the target protein and each database protein for further refinement based on an information retrieval (IR) approach which goes beyond the capability of the traditional "bag of words" concept by introducing spatial relationships among these fragments. Let Q = q 1, q 2 , . . . , q L Q and P = p 1, p 2 , ..., p Lp be a target protein with L Q residues and a database protein with L P residues, respectively. Here, q and p represent 3D coordinates of the C a atoms. A fragment f is a set of L f ( = 8) continuous residues with the direction from N terminal to C terminal along the protein backbone. A MFS includes two non-empty subsets, F Q and F P , which contain an order of fragments that conforms to some criteria of structural similarity between Q and P, respectively. The fragments in a MFS will then be used to generate a rough alignment between Q and P in the fragment-level alignment.
The MFS search utilizes the substructure mapping method of the Index-based Substructure Alignment algorithm [32], developed by the authors, to retrieve similar fragments from the database proteins. In this method, substructures of the database proteins, extracted by a large set of pairs of windows along the backbones, are indexed off-line by an indexing tree in which similar substructures are clustered into same leaf node, denoted by t Λ i , and one substructure is selected as representative for each leaf node. Such representative structures preserve certain topological information, both locally and globally, from two disjoint substructures with various ranges of distances. Similarly, substructures in the target protein Q are indexed by an indexing tree in which each leaf node is denoted by t Q i . The representative substructure of each t Q i is used to search the indexing tree of database and a list of best matched t Λ is returned. For simplicity, we use t to denote t Q i and t Λ . The database proteins that have substructures in t Λ can be found by an inverted index. Such a database protein, P, can be represented by an order of substructures, denoted by Ω t , occurring in t. Likewise, the protein Q can be represented by an order of substructures, denoted by Ω Q t , occurring in t. As substructures identified by the same t are similar, they can be used as "anchors" for rough alignments. For detailed explanation of the substructure mapping method, readers are referred to [32].
In ppsAlign, substructures are further projected into fragments as follows: if any residue of a substructure from Ω P t (or Ω Q t ) is located in a fragment, the fragment is selected and added to F P (or F Q ). The fragment Figure 1 Framework of ppsAlign. The framework consists of both GPU-and CPU-based processes. The input includes a target protein and database proteins. The output contains all the structural alignment results between the target protein and each database protein.
subsets F P and F Q are used to construct a MFS between the protein Q and P. After searching all t Q , we can obtain all possible MFS' between Q and database proteins, if any. In this step, if the algorithm cannot find any MFS for a database protein, all the fragments from Q and the database protein are selected to form a MFS. An example of MFS searching and construction is illustrated in Additional file 1: Figure S1.
After searching MFS, a filtering process is called to remove redundant MFS'. Then, the non-redundant MFS' between Q and each database protein are ranked according to scoring function S MFS and the top N seed sets are selected. The scoring function is defined as follows: where N Q and N P denote the cardinality of F Q and F P in a MFS, respectively.
N f Q = L Q /L f and N f P = L P /L f are the numbers of fragments in the target protein and a database protein, respectively. The third term of the above scoring function is used to favor MFS' which have comparable N Q and N P . The values w 1 , w 2 , and w 3 are used to weight the contributions from the three terms.
The data needed by ppsAlign in order to compute the alignments on GPU are: structures of the protein Q and of the database proteins, and MFS'. To allow efficient processing, those data must be judiciously laid out on the GPU memories. Specifically, the database structures are transferred to the texture memory before execution. The MFS' are transferred from CPU memory to GPU global memory as inputs to the fragment-level alignment (see Figure 1). Finally, the structure of protein Q is stored in the constant memory, which has smaller capacity but lower access latency compared to the texture memory.

Fragment-level alignment
In this step, the fragments in each MFS are assembled to obtain initial alignments using Dynamic Programming (DP). For a given MFS, the DP algorithm first sorts the fragments from F Q and F P according to their locations in Q and P. Then, it computes the similarity score S f (i, j) of each fragment pair for 1 ≤ i ≤ N Q and 1 ≤ j ≤ N P using the following recurrence: where G f is gap penalty and S f is based on the inverse cosine distance of fragment's feature vector. Given a fragment pair, A and B, and their corresponding feature vectors D A and D B , S f is calculated as follows: where <D A , D B > is the inner product of D A and D B , || D A || and ||D B || are the norm of D A and D B , respectively. In the current implementation, features only use Euclidean distance of each residue pair for fast calculation. The main reason for using feature distance as an approximate measure of fragment similarity is the need for simple control paths due to the SIMT (Single Instruction, Multiple Thread) computing mode of the GPU [30]. Traditional methods usually calculate RMSD and find an optimal transformation using the Kabsh algorithm [33], which contains complex control flows and is therefore not suitable for the SIMT mode. This step provides a rough alignment result which will be refined by the residue-level alignment.

GPU computation for fragment-level alignment
The pseudo-code in Figure 2 describes the fragmentlevel alignment. The algorithm splits the computation into three GPU kernels. The first kernel performs the computation of the fragment scores S f by assigning a database protein to each thread. This kernel performs all-against-all fragment comparisons and writes similarity scores into the GPU global memory.
The second GPU kernel implements the DP algorithm, whereas the third one performs back tracing. The total number of threads N T that can run concurrently on the GPU is mainly limited by the global memory capacity of the GPU (in this phase each thread requires approximately 10 kB of memory). Suppose that the total number of MFS between Q and all database proteins is N F . If N F >N T , the overall MFS' will be divided into N batch = ⌈N F /N T ⌉ batches. ppsAlign sequentially schedules each batch to run on GPU. In each batch, the DP is first executed as a GPU kernel and each thread corresponds to a MFS. Then, the GPU kernel for the back tracing is called to obtain alignment paths for each MFS. When a batch terminates, ppsAlign transfers the output (i.e., alignment path for each MFS) from the GPU memory to CPU memory. After aggregating the outputs from all batches, ppsAlign first performs filtering to remove redundant alignments, and then assembles all the fragments along the alignment paths to form residue alignments which will be further refined by the residue-level alignment.
It is critically important to effectively utilize the limited memory resources of the GPU. Our GPU memory allocation scheme is exemplified in Figure 3. The MFS' are stored in a 2D block of size (N T × N S ) where N S is the maximal size of all MFS'. Each thread of the DP kernel fetches a MFS to initialize its setting. The score and direction matrices are stored in a separate 3D memory block of size (N Q × N P × N T ), where N Q and N P represent the maximal number of fragments from the target protein and all the database proteins, respectively. The alignment paths are then stored in a 2D block of size (N P × N T ). In ppsAlign, multiple GPU memory accesses are coalesced into a single transaction whenever possible. This fragment-level alignment process provides a selection of seed fragments which are likely to be successful in accurate alignment. Only approximately 1.6% of the total execution time is spent in this phase.

Residue-level alignment
The results of fragment-level alignment are then refined by a residue-level alignment process. Such a refined alignment result is an ordered set R = {(q i , p i ) | q i ∊Q', p i ∊P'}, where Q' ⊆ Q (target protein) and P' ⊆ P (database protein).
In this step, a rigid-body transformation (rotation and translation) T that minimizes the RMSD of R is first calculated. Then, the transformation T is used to superimpose all the residues from Q over P. Finally, the DP algorithm is used to find an alignment path between Q and P similar to the fragment-level alignment. In the DP, the gap penalty G r is set to 0 and the residue similarity score S r uses the scoring function from TM-align [10]. However, our framework can be configured to use any suitable residue-level scoring function [1].
As we mentioned previously, the complex control flows present in the traditional method for computing T (e.g., Kabsch algorithm [33]) make it unsuitable for the SIMT computing model of GPU. To address this issue, we implement and optimize a fast algorithm using quaternion-based characteristic polynomial (QCP) [34], gRMSD-QCP, to determine the transformation T on GPU. In the gRMSD-QCP kernel, coordinates of residues from two protein structures are first written into the GPU global memory and origin of coordinate is moved to the center of coordinates for each protein.
Then, the inner-product of two coordinate matrices is calculated, which is used by QCP for RMSD calculation. The work flow of gRMSD-QCP is relatively simple, and therefore amenable of efficient GPU implementation.

GPU computation for residue-level alignment
The GPU implementation of residue-level alignment starts with loading coordinates of residues from R to the GPU global memory. Next, the gRMSD-QCP kernel is invoked to calculate the transformation T which is also written into the GPU global memory. Finally, a DP kernel is called to find residue alignments which are transferred into the CPU memory after the kernel terminates. As in the fragment-level alignment phase, the residuelevel alignments are divided into batches according to the memory requirement of the threads. After all the batches are executed, ppsAlign aggregates the outputs of residue alignment R, which are used in the next step for searching the maximal alignment.

Maximal alignment search
The maximal alignment search is used to find the largest subset M ⊆ R such that the score of the residue alignment R, denoted by S a , is maximized. Because finding the largest subset M is extremely time-consuming, a heuristic and approximate algorithm, MaxSub [35], has been developed to solve this problem. In ppsAlign, a variant of MaxSub, gMaxSub, is designed to parallelize the search process on the GPU. In the current implementation of ppsAlign, S a is defined using the TM-score [10].

GPU computation for maximal alignment search
The input of this step is the alignment R from the residue-level alignment which has L R aligned residue pairs. The original MaxSub algorithm on CPU searches the largest subset M by shifting a window W of size L W along R (see Figure 4a). This results into (L R -L W + 1) shift operations which are candidates for parallelization. Then, gMaxSub searches the maximal alignment by concurrently dispatching each calculation of W to different GPU threads (see Figure 4b). Figure 5 describes a pseudo-code of gMaxSub. First, for each residue alignment R between Q and P, (L R -L W + 1) windows are generated. Second, the gRMSD-QCP kernel is invoked to calculate the transformation T for the residue pairs within each W and then T is used to superimpose residues from Q over P in R. Third, residue pair (q i , p i )∊R is added into W if its distance is below a cutoff (4.0 Å) after the superimposition. The above two steps (i.e., gRMSD-QCP and window extension) are iteratively executed for N MS times. Forth, the last W is assigned to M and S a is calculated.
As in previous phases, the maximal alignment searches are divided into batches. After all the batches are executed, ppsAlign aggregates the outputs of subset and selects the one with the largest S a as the largest subset M. The transformation T associated with the largest subset M is used to superimpose all the residues from Q over P and the residue pair whose distance is below a cutoff (4.0Ǻ) is selected to form a new residue alignment R.
After gMaxSub terminates, if the current iteration number <N iter , the residue alignment R will be first filtered to remove redundant alignments from the same database protein and then sent to the residue-level alignment for further refinement; otherwise, R will be used as input for the next step of final assessment.

Final assessment of alignment quality
After structure alignments are computed on GPU, the residue alignments R are transferred from the GPU memory to CPU memory. We use PSI (percentage of structural similarity), defined as the percentage of residue pairs from R with distance below 4.0 Å, to score the alignment quality. We also assess the statistical significance of the alignments through z-Score of the PSI, which is given as follows: where μ PSI and s PSI denote mean and standard deviation of PSI for a given protein chain length, respectively. The parameters μ PSI and s PSI are obtained using a method similar to [12], leading to the following settings: μ PSI = 375.64·k -0.5295 and s PSI = 99.67·k -0.5885 . Here, k is the minimum chain length between target and database proteins.

Results
In this section, we compare ppsAlign's performance to concurrent methods in terms of alignment quality and computational efficiency. We evaluate ppsAlign using an NVIDIA Tesla C2050 GPU card equipped with 448 cores at 1.15 GHz and 3 GB global memory. The concurrent methods include TM-align [10], Fr-TM-align [11], and MAMMOTH [12], which share similar computational framework as ppsAlign. As DALI [8] and CE [9] have been exhaustively evaluated elsewhere [10], we  do not include these approaches in our experiments. We download software packages of these methods from their official websites and evaluate the performance on a Linux personal computer with AMD Opetron dual-core 1000 series processor at 1.8 GHz and 8 GB RAM.
The main purpose of structure alignment is to maximize the number of aligned residues (N e ) while minimizing the RMSD of the aligned residues, denoted by cRMSD. To eliminate the size dependence of cRMSD on N e , in this paper we use a normalized measure of cRMSD, RMSD 100 , to evaluate the alignment quality. RMSD 100 is calculated as follows [36]: , which corresponds to the cRMSD value expected if the two protein structures were 100 residues long.
To evaluate efficiency, we measure the execution time on a dataset in which the protein's chain length is in a range from 80 to 500 residues extracted from ASTRAL 1.75 database [37] with sequence identity < 40% (ASREAL40). The database protein chain length is determined by the global memory capacity on the GPU card. However, this limitation is not severe as 98.5% ASTRAL40 protein chains have less than 500 residues. We expect that the advancement of GPU technology will solve this memory limitation issue in the near future so that the ppsAlign algorithm can handle protein chains longer than 500 residues. Currently we can handle structures larger than 500 residues in one of the following two ways: 1) by sending the alignment tasks to our CPU-based algorithm and 2) if resource allows, by using another GPU card to align the remaining 1.5% of large structures. Although the algorithm can also handle small protein chains below 80 residues (~16% of ASTRAL40), we do not use them for our testing because they have relatively simple topologies [38].
To efficiently utilize global memory of GPU card, the entire database proteins are sorted according to the chain length and then divided into two small datasets: 1) D 1 , which includes 6, 569 proteins in the range [80, 250) residues selected from ASTRAL40 according to the length distribution of proteins, and 2) D 2 , which includes 1, 912 proteins in the range [251, 500) residues. The target dataset includes 100 proteins which are randomly selected in the range [80, 250) from ASTRAL40. For each target protein, a one-against-all alignment is performed with all database proteins and totally 100 × (6, 569 + 1, 912) = 848, 100 non-homologous protein pairs are compared during the experiment.

Scalability of ppsAlign
There are two critical parameters for ppsAlign, namely the maximal number of iteration (N iter ) and the maximal number of MFS (N seed ). Intuitively, when increasing N iter or N seed , ppsAlign will often obtain better alignment quality but the execution time will be significantly lengthened. To verify this, we preliminarily investigate the performance of different settings using a small target dataset of 17 proteins and the dataset D 1 in terms of RMSD 100 . The experimental results of RMSD 100 with N iter = {3, 5, 7} and N seed = {10, 30, 50, 70} are shown in Figure 6, which illustrates that ppsAlign has decreased RMSD 100 when N iter and/or N seed is increasing. This figure can be used as a guideline for parameter selection of ppsAlign. For a fair comparison of efficiency improvement from ppsAlign to a concurrent method, we select a combination of N iter and N seed that achieves comparable alignment quality.

Speedup over TM-align and CPU-based ppsAlign
In this experiment, ppsAlign is executed with a parameter setting of N iter = 3 and N seed = 20 which results in a comparable RMSD 100 to TM-align and the CPU version of ppsAlign. Table 1 summarizes the alignment quality, average execution time, and corresponding speedup. ppsAlign achieves speedups of 23.8 and 35.9 compared to CPU-based ppsAlign and TM-align, respectively. The detailed comparison of alignment quality of ppsAlign and TM-align can be found in Additional file 1: Table S1.

Speedup over Fr-TM-align
Since Fr-TM-align performs more iterations to improve its alignment quality over TM-align, we increase both iteration and seed numbers of ppsAlign algorithm to achieve a comparable alignment quality with Fr-TMalign. The experimental results of RMSD 100 , average execution time, and corresponding speedup with N iter = 6 and N seed = 30 are shown in Table 2. ppsAlign achieves speedup 64.7 compared to Fr-TM-align with the same alignment quality. The detailed comparison of alignment quality of ppsAlign and Fr-TM-align can be found in Additional file 1: Table S2.

Speedup over MAMMOTH
In the last experiment, we use the same dataset to compare the performance of ppsAlign and MAMMOTH. Different from TM-align and Fr-TM-align, MAM-MOTH is originally developed for the purpose of largescale comparisons with high efficiency at the cost of the reduction of alignment quality. Because of its high speed, MAMMOTH is used as a benchmark for maximal speed on the CPU platform in [39]. The experimental results of RMSD 100 , average execution time, and corresponding speedup with N iter = 1 and N seed = 8 are shown in Table 3. ppsAlign achieves speedup 40.3 compared to MAMMOTH and higher alignment quality. The detailed comparison of alignment quality of ppsAlign and MAMMOTH can be found in Additional file 1: Table S3.

Discussion
The framework of ppsAlign is a general-purpose GPU platform for protein structure alignment which could take many concurrent methods, such as TM-align [10] and Fr-TM-align [11], into the parallelized algorithm design. An important novelty in our approach is to create a unique design to manage resources of the GPU architecture. First, an intelligent decomposition of the application in kernels characterized by different parallelization strategies is provided. In the existing methods for GPU-based sequence alignment mentioned previously, a pair-wise comparison is either assigned to a thread (i.e., inter-task parallelization) or corporately performed by a block of threads (i.e., intra-task parallelization) [18,20]. However, as the workflow of structure alignment is more complicated than that of sequence alignment, neither the inter-nor the intra-task parallelization can efficiently exploit the GPU computing power. Therefore, ppsAlign utilizes a hybrid inter-and intratask parallel model. In particular, each task (i.e., pairwise structural comparison) is divided into several independent seed alignments. Each seed alignment is assigned to a different thread (inter-task parallelization), whereas each block executes one or more pair-wise comparisons (intra-task parallelization). Second, a smart design of memory layout and memory access patterns are developed, the former allowing an effective use of the memory capacity at the different levels of the GPU memory hierarchy, and the latter minimizing the memory bandwidth requirement of the application. Third, several efficient algorithms for avoiding complex control flow on GPU are proposed to take advantage of the SIMT nature of the GPU. For instance, a feature-based measure is used to compute similarity of fragment at the fragment-level alignment which can avoid time-consuming RMSD calculation at the initial stage of structure alignment.
One of the major ways in which ppsAlign differs to other methods is implementing protein structure alignment at the residue level on GPU. Recently, the GPUenhanced algorithms are gaining an increasing attention in bioinformatics. One of the major steps was a GPU implementation of a one-against-all sequence comparison using Smith-Waterman algorithm [20,21]. With these methods, a sequence database search can be performed resulting in a list of similarity scores, while these RMSD 100 is the expected value of cRMSD if the two protein structures were 100 residues long. ppsAlign is running on NVIDIA Tesla C2050 GPU card and other methods (CPU-based ppsAlign and TM-align) are running on a computer with AMD Opetron dual-core 1000 series processor at 1.8 GHz and 8 GB RAM. The parameter settings of ppsAlign are N iter = 3 and N seed = 20  RMSD 100 is the expected value of cRMSD if the two protein structures were 100 residues long. ppsAlign is running on NVIDIA Tesla C2050 GPU card and MAMMOTH is running on a computer with AMD Opetron dual-core 1000 series processor at 1.8 GHz and 8 GB RAM. The parameter settings of ppsAlign are N iter = 1 and N seed = 8 methods do not provide the detailed alignment information of the best hits [23]. To provide detailed residueresidue correspondence, GPU-BLAST [24] was developed, that allowed to accelerate the NCBI-BLAST search, achieving the speedup between 3 and 4 on an NVIDIA Tesla C2050 GPU card. In addition, another approach to protein sequence that uses backtracking on GPU to construct alignment of residues has been proposed [23]. Compared to the sequence alignments, the implementation of structure alignment on GPU is a more challenging task, because some routines (e.g., RMSD calculation) can cause severe divergence among GPU threads and decrease performance of GPU. One of the first structure comparison methods implemented on GPU, SA Tableau Search [27], aligns protein substructure at the secondary structure level, that is by aligning secondary structure elements, while not aligning structures at the residue level. To the best of our knowledge, ppsAlign is the first protein structure comparison platform for GPU that provides the residue level structural alignment.
The substantial contribution of ppsAlign is to provide a high-performance computing platform for the research community. An alternative solution to accelerate the protein structure alignment is to install more CPU computing cores in a single machine. However, using more CPU cores in a single machine need to upgrade main board and memory accordingly, which could decrease price/performance ratio. In contrast, installing a GPU card into a PCIe (Peripheral Component Interconnect Express) slot does not require extra cost and more GPU cards can be installed into one PCIe slot by a switch. In this paper, an NVIDIA Tesla C050 GPU card is utilized to evaluate performance, which has also been used in GPU-BLAST [24]. Though it is a high end product of NVIDIA, we expect its price will drop in the near future due to market demand in gaming industry. fragment set; PCIe: Peripheral Component Interconnect Express; RMSD: Root mean square deviation; SA. simulated annealing; SIMT: Single Instruction: Multiple Thread; SSE: secondary structure element.