 Research note
 Open Access
 Published:
An OpenMPbased tool for finding longest common subsequence in bioinformatics
BMC Research Notesvolume 12, Article number: 220 (2019)
Abstract
Objective
Finding the longest common subsequence (LCS) among sequences is NPhard. This is an important problem in bioinformatics for DNA sequence alignment and pattern discovery. In this research, we propose new CPUbased parallel implementations that can provide significant advantages in terms of execution times, monetary cost, and pervasiveness in finding LCS of DNA sequences in an environment where Graphics Processing Units are not available. For general purpose use, we also make the OpenMPbased tool publicly available to end users.
Result
In this study, we develop three novel parallel versions of the LCS algorithm on: (i) distributed memory machine using message passing interface (MPI); (ii) shared memory machine using OpenMP, and (iii) hybrid platform that utilizes both distributed and shared memory using MPIOpenMP. The experimental results with both simulated and real DNA sequence data show that the shared memory OpenMP implementation provides at least twotimes absolute speedup than the best sequential version of the algorithm and a relative speedup of almost 7. We provide a detailed comparison of the execution times among the implementations on different platforms with different versions of the algorithm. We also show that removing branch conditions negatively affects the performance of the CPUbased parallel algorithm on OpenMP platform.
Introduction
Finding Longest Common Subsequence (LCS) is a classic problem in the field of computer algorithms and has diversified application domains. A subsequence of a string is another string which can be derived from the original string by deleting none or few characters (contiguous or noncontiguous) from the original string. A longest common subsequence of two given strings is a string which is the longest string that is a subsequence of both the strings. The sequential version of the LCS algorithm using “equalunequal” comparisons takes \(\varOmega \left( {\text{mn}} \right)\) time, where m and n represent the length of the two sequences being compared [1, 2]. It is necessary to mention that the problem of finding the LCS of more than two strings is NPhard in nature [3, 4].
LCS has various applications in multiple fields including DNA sequence alignment in bioinformatics [5,6,7], speech and image recognition [8, 9], file comparison, optimization of database query etc. [10]. In the field of bioinformatics, pattern discovery helps to discover common patterns among DNA sequences of interest which might suggest that they have biological relation among themselves (e.g., similar biological functions) [11]. In discovering patterns between sequences, LCS plays an important role to find the longest common region between two sequences. Although a praiseworthy amount of efforts have been made in the task of pattern discovery, with the increase of sequence lengths, algorithms seemingly face performance bottlenecks [12]. Furthermore, with the advent of nextgeneration sequencing technologies, sequence data is increasing rapidly [13], which demands algorithms with minimum possible execution time. Parallel algorithms can play a vital role in this regard.
Out of the parallel solutions of the LCS problem, antidiagonal [14] and bitparallel [15] algorithms are few of the firsts and noteworthy attempts. Recently, with the rise of Graphics Processing Unit (GPU)based accelerators, several Compute Unified Device Architecture (CUDA)based GPU targeted solutions to the LCS problem have been proposed. Yang et al. [16] are one of the firsts to propose an improved rowwise independent parallel version of the LCS algorithm by changing the data dependency used by a dynamic programming approach and using unique memoryaccess properties of GPUs. More recently, Li et al. [17] have proposed a parallel formulation of the antidiagonal approach to the LCS algorithm using a GPUbased model. Although these GPUbased models offer faster execution times, GPU devices are still quite expensive in nature, hence only few computers are equipped with GPUs. In such cases, to achieve performance improvement, CPUbased parallel LCS algorithms (e.g. message passing interface (MPI) and open multiprocessing (OpenMP)) are still greatly demanded. However, to the best of our knowledge, there is no such publicly available CPUbased tool for the end users. We addressed this gap by developing a new OpenMPbased tool for the end users by improving the rowwise independent version [16] of the LCS algorithm. Moreover, we also developed two other CPUbased parallel implementations (MPI, hybrid MPIOpenMP) of the algorithm and provided a detailed benchmarking of all these implementations on simulated and real DNA sequence data, which was absent for this version of the LCS algorithm. The main contributions of this study are listed below.

1.
A new OpenMPbased publicly available tool for finding length of LCS of DNA sequences for the end users.

2.
A detailed benchmarking of the newly developed CPUbased parallel algorithms using different performance metrics on both simulated and real DNA sequence data, where we found that our OpenMPbased algorithm provides atleast 2 times absolute speedup (compared to the best sequential version) and 7 times relative speedup (compared to using only 1 thread).

3.
A comparison of the newly developed OpenMPbased LCS algorithm with and without branch conditions.
Main text
Preliminaries
Given two sequence strings \(A\left[ {1,2, \ldots , m} \right]\) and \(B\left[ {1,2, \ldots , n} \right]\), the LCS of the two strings can be found by calculating the longest common subsequence of all possible prefix strings of \(A\) and \(B\). The LCS of a prefix pair \(A\left[ {1,2, \ldots ,i} \right]\) and \(B\left[ {1,2, \ldots ,j} \right]\) can be calculated using the previously calculated prefix pairs with the following recurrence relation:
Here, \(R\) is a score table consisting of the lengths of the longest common subsequences of all the possible prefixes of the two strings. The length of longest common subsequence of \(A\) and \(B\) can be found in the cell \(R\left[ {m,n} \right]\) of table \(R\). From Eq. 1, we can see that the value of a cell \(R\left[ {i,j} \right]\) in the scoring table R depends on \(R\left[ {i  1,j  1} \right]\), \(R\left[ {i,j  1} \right]\) and \(R\left[ {i  1,j} \right] .\)
Rowwise independent algorithm (Version 1)
Yang et al. [16] has devised a rowwise independent parallel algorithm by removing dependency among the cells of the same row. The modified equation is as follows:
Here, \(k\) denotes the number of steps required to find either a match, such as \({\text{A}}\left[ {\text{i}} \right] = {\text{B}}\left[ {{\text{j}}  {\text{k}}} \right]\) or \({\text{j}}  {\text{k}} = 0\). Yang et al. [16] has divided their algorithm into two steps. First, they calculated the values of \({\text{j}}  {\text{k }}\) for every \(i\) and stored these values in another table named P. The equation to calculate the value of \(P\) is given below.
Here, \(C\) is the string comprised of the unique characters of string \(A\) and string \(B\). After that the value of score table \(R\) were calculated using the following updated equation.
Here, c denotes the index of character \(A\left[ {i  1} \right]\) in string \(C\).
Rowwise Independent Algorithm (Version 2)
As branching can hamper the performance of parallel algorithms, Yang et al. [16] further modified the calculation of \(P\) matrix using the following equation.
Then Eq. (4) can be rewritten as follows with one branching condition reduced.
From the two versions of rowwise independent algorithms, we can see that the calculation of values of table P only depends on the same row. In contrast, the calculation of the values of score table R depends on the previous row only.
Methodology
For the calculation of the P table, each row is independent and can be calculated in a parallel way. Therefore, in our MPI implementation, we scattered the P table to all the processes in the beginning. After calculating the corresponding chunk values, process number zero gathers the partial results from all the other processes. For the calculation of score table R, elements in each row can be scattered among the processes and gathered afterwards. This scatter and gather operations need to be done for every row. Hence, the communication and synchronization overheads are expected to be higher for the MPI implementation approach.
A shared memory implementation can largely mitigate the communication and synchronization overheads of distributed memory implementations which inspired us to develop the shared memory (OpenMP) implementation. In case of the OpenMP implementation, we used worksharing construct #pragma omp parallel for (an OpenMP directive for sharing iterations of a loop among the available threads) to compute the elements of a single row of the score table R in parallel. We tried different scheduling strategies (static, dynamic, and guided) for sharing works among the threads. The calculation of the P table was also shared among threads. This time, the outer loop was parallelized using #pragma omp parallel for construct, as every row is independent of each other.
In the hybrid MPIOpenMP approach, we selected the optimum number of processes and threads from the experiments of MPI and OpenMP approach. After that we scattered every row among processes and inside a single process we further shared the chunk of rows among threads using #pragma omp parallel for. To account for longer DNA sequences, we optimized the space complexity of all the three implementations where we kept only the current and the previous row of the score table.
Results and discussion
Data sets and specifications of the computer
We used two different data sets for our experiments. First one is a simulated DNA sequence data, collected from University of California Riverside’s (UCR) random DNA sequence generator [18]. The lengths of the different pairs of sequences are between 128 base pairs to 32,768 base pairs. The second data set consists of 8 virus genome sequence pairs and two entire chromosome genome sequence pairs of two eukaryotes, collected from the website of National Center for Biotechnology Information (NCBI) [19]. The selected sequence lengths vary from 359 base pairs to 32,276 base pairs for the viruses, and from 15,05,371 base pairs to 1,61,99,981 base pairs for the eukaryotes. Table 1 represents the selected virus and eukaryote pairs and their sequence lengths.
All the experiments were run on University of Manitoba’s oncampus cluster computing system (Mercury machine). The cluster consists of four fully connected computing nodes with 2gigabit ethernet lines between every pair of nodes. Each node consists of two 14core Intel Xeon E52680 v4 2.40 GHz CPUs with 128 GB of RAM. Having a total of 28 cores inside, with the help of hyperthreading, each node is capable of running twice as many hardware threads (56 threads) at a time.
Comparison among different approaches
For the MPI approach, we tuned for the number of processes and found that using 4 process gives better relative speedup. For the OpenMP approach, we tuned for the number of threads and the scheduling strategy (static, dynamic, and guided). We found that using 16 threads and a static scheduling of work sharing among the threads provided 7 times relative speedup (see Fig. 1a, b). Finally, for the hybrid MPIOpenMP approach, we used 4 processes (or nodes) and 16 threads.
For comparison purpose, we experimented with a varying number of sequence lengths. Figure 2a, illustrates the execution times for different implementations where we can see that our OpenMP implementation outperforms all the other approaches and is almost 2 times faster than the best sequential version. However, the MPI approach provides poor results due to the increased amount of communication and synchronization overhead caused by m scatter and gather operations (blocking in nature). The hybrid MPIOpenMP approach performs the worst. As in the hybrid approach, the number of scatter and gather operations is the same as the MPI approach, and it also adds synchronization overheads of the OpenMP, and therefore this implementation provides the worst result. This observation indicates that distributed memory implementation is discouraged for the LCS algorithm. In order to validate our results, we also experimented with the realDNA sequence data (see Table 1). From Fig. 2b, we can see that even for the real data the OpenMP implementation is having atleast 2 times speedup from the best sequential version. For longer DNA sequences (SP 9, SP 10 in Fig. 2b), the OpenMP speedups are even higher, whereas the MPI and the hybrid implementations took more than a week to complete.
Comparison between the two versions of the algorithm in OpenMP approach
In the above experiments, we used version 2 (without branching) of the rowwise independent algorithm. In order to compare the execution times of the two versions (version 1 and version 2), we also developed the version 1. Figure 2c illustrates the execution times for the two versions with varying sequence sizes and 1 thread only where we can see that version 1 performs relatively better than version 2 of the algorithm. Although version 2 has removed branching conditions, it has added more computations which might be the reason for its relatively bad execution times. Furthermore, CPU architectures are much better at branch predictions than GPUs. Therefore, the second version of the rowwise independent parallel algorithm performed well on GPUs [16] but not on CPUs.
Limitations
Our study investigated parallelization of the rowwise independent version of the LCS algorithm only, as it provided ease in parallelization using the MPI, and OpenMP frameworks. As we found that the version of the rowwise independent algorithm with branching performs better than the other version, we will investigate this version in more detail in the future. We will also investigate other versions of the algorithm with the goal of finding better parallelization strategies.
Availability and requirements
 Project name::

LCS row parallel (CPU)
 Project home page::
 Operating systems::

Platform independent
 Programming language::

C
 Other requirements::

gcc 4.8.5 or later, OpenMPI version 1.10.7 or later, OpenMP version 3.1 or later
 License::

MIT License
 Any restrictions to use by nonacademics::

None.
Abbreviations
 CUDA:

compute unified device architecture
 GPU:

graphics processing unit
 LCS:

longest common subsequence
 MPI:

message passing interface
 OpenMP:

open multiprocessing
 UCR:

University of California Riverside
 NCBI:

National Centre for Biotechnology Information
References
 1.
Ullman JD, Aho AV, Hirschberg DS. Bounds on the complexity of the longest common subsequence problem. J ACM. 1976;23:1–12.
 2.
Wagner RA, Fischer MJ. The stringtostring correction problem. J ACM. 1974;21:168–73.
 3.
Maier D. The complexity of some problems on subsequences and supersequences. J ACM. 1978;25:322–36.
 4.
Garey MR, Johnson DS. Computers and intractability: A guide to the theory of npcompleteness (series of books in the mathematical sciences), ed. Comput Intractability. 1979. p. 340.
 5.
Ossman M, Hussein LF. Fast longest common subsequences for bioinformatics dynamic programming. Population (Paris). 2012;5:7.
 6.
Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988;85:2444–8.
 7.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.
 8.
Guo A, Siegelmann HT. Timewarped longest common subsequence algorithm for music retrieval. In: ISMIR. 2004.
 9.
Petrakis EGM. Image representation, indexing and retrieval based on spatial relationships and properties of objects. Rethymno: University of Crete; 1993.
 10.
Kruskal JB. An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 1983;25(2):201–37.
 11.
Ning K, Ng HK, Leong HW. Analysis of the relationships among longest common subsequences, shortest common supersequences and patterns and its application on pattern discovery in biological sequences. Int J Data Min Bioinform. 2011;5:611–25.
 12.
Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33:4899–913.
 13.
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13:e1002195.
 14.
Babu KN, Saxena S. Parallel algorithms for the longest common subsequence problem. In: HiPC. 1997. p. 120–5.
 15.
Crochemore M, Iliopoulos CS, Pinzon YJ, Reid JF. A fast and practical bitvector algorithm for the longest common subsequence problem. Inf Process Lett. 2001;80:279–85.
 16.
Yang J, Xu Y, Shang Y. An efficient parallel algorithm for longest common subsequence problem on gpus. In: Proceedings of the world congress on engineering. 2010. p. 499–504.
 17.
Li Z, Goyal A, Kimm H. Parallel Longest Common Sequence Algorithm on Multicore Systems Using OpenACC, OpenMP and OpenMPI. In: 2017 IEEE 11th international symposium on embedded multicore/manycore systemsonchip (MCSoC). 2017. p. 158–65.
 18.
Random DNA Sequence Generator. http://www.faculty.ucr.edu/~mmaduro/random.htm. Accessed 2 Apr 2018.
 19.
National Center for Biotechnology Information (NCBI). https://www.ncbi.nlm.nih.gov/. Accessed 20 Sept 2018.
Authors’ contributions
RS formulated the problem, developed the implementations and drafted the manuscript. PT and PH conceived the study design. PH directed the data collection and analysis procedure. PT, PH and PI interpreted the results and significantly revised the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank all the members of the Hu Lab for their valuable suggestions.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The source code, used data set, and documentation is available at https://github.com/RayhanShikder/lcs_parallel.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This work was supported in part by Natural Sciences and Engineering Research Council of Canada and the University of Manitoba, which provided with the research assistantship for Rayhan Shikder to perform the study.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Longest common subsequence (LCS)
 DNA sequence alignment
 Parallel algorithms for LCS
 LCS on MPI and OpenMP
 Tool for finding LCS