Efficient computation of spaced seeds
 Silvana Ilie^{1}Email author
DOI: 10.1186/175605005123
© Ilie et al; licensee BioMed Central Ltd. 2011
Received: 10 November 2011
Accepted: 28 February 2012
Published: 28 February 2012
Abstract
Background
The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, lineartime heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program.
Findings
SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity.
Conclusion
Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.
Keywords
Similarity search Local alignment Spaced seed Heuristic algorithm SensitivityBackground
The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. This problem can be solved exactly using the dynamic programming algorithm of SmithWaterman in quadratic time. Many instances, including all database searches, are too large for this approach to be feasible and heuristic algorithms are used instead [1, 2]. The most widely used program in bioinformatics, BLAST [2, 3], is one such tool. It uses the socalled "hit and extend" approach: a hit consists of 11 consecutive matches between two sequences and represents a potential local alignment. The hit is then extended both ways in search for similarity.
It is clear that not all local alignments have to include an identical stretch of length 11. It has been already noticed in [4] and then again in [5] that requiring that the matches are not consecutive increases the chances of finding alignments. The idea of optimizing the way the required matches are placed has been investigated in [6, 7], the latter having used it in a similarity search software, PatternHunter. Much work has been dedicated to spaced seeds. For a survey of earlier work, we refer the reader to [8].
The 11 consecutive matches of BLAST are called a contiguous seed, denoted 11111111111 (for 11 consecutive matches), whereas the one of PatternHunter is a spaced seed, 111*1**1*1**11*111; a 1 represents a match and * a don't care position. The number of 1's represents the weight of the seed. The probability of finding local alignments, under specific conditions, to be made precise later, is called sensitivity.
We notice an essential trade off. Decreasing the number of matches, that is, the weight of the seed, increases the sensitivity but also the number of random hits, decreasing specificity. On the other hand, it is intuitively clear that several different seeds will hit different alignments, thus having increased sensitivity. It has been noticed by [9] that doubling the number of seeds can account for the decrease in weight, thus simultaneously increasing sensitivity without reducing specificity. PatternHunterII [9] uses 16 different seeds of weight 11. For comparison, under similar conditions, the sensitivity of the BLAST, PatterHunter, and PatternHunterII seeds, all of weight 11, are 0.30, 0.47, and 0.92, respectively, a very large difference.
Multiple spaced seeds represent the current state of the art in similarity search and are used by many software programs, in a variety of applications, such as sequence alignment [6, 9, 10], read mapping [11, 12], or oligonucleotide design [13]. It is therefore of great importance to be able to compute seeds with high sensitivity. The only way to find optimal seeds seems to be by trying all possible ones. This brute force approach includes two exponential steps. First, there are exponentially many candidates. Second, computing sensitivity is exponential as well. Therefore, only single seeds can be computed this way. For multiple seeds, since the relevant problems are hard [14, 15], heuristic algorithms must be used. Among many such algorithms, such as Mandala [16] and Iedera [17], only one works in polynomial time: SpEED [18]. SpEED is based on the notion of overlap complexity[19], that is very well correlated with sensitivity but polynomialtime computable. A hill climbing algorithm is used that iteratively swaps a 1 with a * in a random seed in order to improve the overlap complexity.
Our contribution in this paper is to improve the best existing software, SpEED, by increasing its speed and, consequently, the sensitivity of the computed seeds. The first algorithm we give is a bitparallel algorithm for computing overlap complexity. This is of independent interest and alone can speed up the hill climbing of SpEED significantly. However, we give a better algorithm for this heuristic that improves its speed by one order of magnitude. Several tests are provided to prove these claims. Then, the new implementation is employed to compute improved seeds for PatternHunterII as well as BFAST, as they use some of the most demanding seeds. Finally, we show a very significant improvement of the MegaBLAST seeds. At weight 28 they are significantly larger than everything else, yet we manage to compute up to 16 seeds of this weight, with very large improvement in sensitivity over MegaBLAST.
Spaced seeds
A spaced seed is any string containing 1's and *'s. Since having a * at one end of a seed is not useful, we assume that all seed start and end with a 1. For a seed s, the weight w of s is the number of 1's and the length ℓ is the number of all letters. The i th letter of s is denoted s[i]. A multiple spaced seed is a set of seeds S = {s_{1}, s_{2}, ..., s_{ k }}. In the Bernoulli model [9] an alignment is represented as a (random) sequence R of 1's and 0's (matches and mismatches) where the probability p of a match is called similarity. The length N of this region R plays an essential role in the sensitivity. We say that a seed s hits R if there is a position i in R such that, for any j, 0 ≤ j ≤ ℓ1, if s[j] = 1, then R[i + j] = 1. That means, aligning s with R starting at position j causes all 1's in s to correspond to 1's in R. This definition extends naturally to multiple seeds: S hits if one of its seeds does so.
The sensitivity of s (or S) is then defined as the probability that s (or S, respectively) hits R. It depends on the distribution of matches in the seed as well as on the length N of the region to be hit and the similarity level p. The sensitivity can be computed by the dynamic programming (exponential) algorithm given in [9].
Overlap complexity
Since the number of expected seeds is proportional to the weight of a seed, it is fair to compare seeds of the same weight. Thus, given two seeds of weight w, one contiguous and one spaced, the spaced one may have higher sensitivity because its hits do not overlap as much and are therefore better distributed, covering more similarities. The distribution of the 1's is nevertheless crucial for the quality of the seed. For instance, periodically spaced seeds are worse than contiguous. The problem of finding optimal seeds seems difficult and heuristic algorithms are employed. The only polynomialtime algorithm, implemented in SpEED [18], is based on the notion of overlap complexity[19], that captures the amount of overlaps between hits. Given two seeds s_{1} and s_{2}, for each of the possible s_{1} + s_{2}1 overlaps between them, denote by σ_{ i }the number of overlapping positions where both seeds have a 1.
A very strong experimental correlation between overlap complexity and sensitivity has been observed in [19]. Seeds with low overlap complexity have high sensitivity. The algorithm of [18] employs a hill climbing algorithm to constructs highly sensitive seeds. Iteratively, (1, *) pairs are swapped to reduce the overlap complexity of a random seed (see [18] for details). The SpEED software runs much faster and produces better seeds than all the other programs.
Faster overlap complexity
Comparison of overlap complexity computation algorithms
w  ℓ  OC  Fast OC  VFast OC 

9  15  0.012  0.004  0.004 
10  17  0.064  0.028  0.024 
11  18  0.116  0.048  0.044 
12  19  0.204  0.084  0.072 
13  20  0.340  0.136  0.116 
14  21  0.564  0.208  0.184 
15  23  2.792  0.948  0.828 
16  24  4.564  1.484  1.300 
17  25  7.276  2.648  1.992 
18  26  11.368  3.968  2.984 
Faster hill climbing
We notice that the value of o_{ ij }[q] is obtained by counting the number of 1's in a NWSE diagonal of OM_{ ij }; precisely, the positions considered are OM_{ ij }[r][t] with rt = q  ℓ_{ j }+ 1.
To simplify the code, we employ the convention that any time we use σ_{ rq }, OM_{ rq }, or OCM[r][q], we assume r < q, otherwise, we would use q, r instead.
A difference needs to be made in both Update Sigma and Update OM between the case r < q and r = q, since in the former the swap affects only one seed whereas in the latter it affects both. Notice also that the value of max(OM_{ rq }[t][i],0) is 1 when (OM_{ rq }[t][i],0) = 1 and 0 otherwise. Therefore, only when the previous value was 1 we subtract 1 from the appropriate component of σ. The value min(OM_{ rq }[t][i],0) is 1 only when OM_{ rq }[t][i] = 1 and, since a swap would cause this 1 to become a 1, the corresponding component of σ is incremented. Two other values, min(OM_{ rq }[t][i],0) and max(OM_{ rq }[t][i]), behave similarly. For the code of OCSigma it is sufficient to observe that (1 ≪ σ[t]) = 2^{σ[t]}.
Results
Comparison of hill climbing algorithms
w  N  p  k  [ℓ_{1}..ℓ_{ k }]  HC  Fast HC 

11  64  .70  16  [14..27]  7.79  1.24 
22  50  .85  10  [25..37]  10.79  1.35 
28  100  .90  8  [36..56]  39.83  3.64 
28  150  .90  8  [39..63]  69.14  5.86 
28  200  .90  8  [41..70]  108.74  8.49 
28  100  .90  16  [33..59]  471.51  41.42 
28  150  .90  16  [36..66]  788.79  62.50 
28  200  .90  16  [39..72]  1075.10  79.51 
Sensitivity comparison of computed spaced seeds for PatternHunter
w  N  p  BLAST  PH  PHII  Mandala  Iedera  SpEED  Fast HC 

(contig.)  (spaced)  (16 seeds)  
11  64  0.70  30.0196  46.7122  92.4114  92.3811  92.0708  93.2526  93.3406 
11  64  0.75  49.4494  69.5844  98.4289  98.4320  98.3391  98.6882  98.7156 
11  64  0.80  71.3993  88.2070  99.8449  99.8448  99.8366  99.8820  99.8859 
Sensitivity comparison of computed spaced seeds for BFAST
w  N  p  1 seed (contig.)  1 seed (spaced)  BFAST (16 seeds)  Mandala  Iedera  SpEED  Fast HC 

22  50  0.85  14.4649  26.8064  58.6907    60.1535  60.8127  60.9329 
22  50  0.90  36.6940  57.9846  87.3359    87.9894  88.5969  88.7120 
22  50  0.95  74.1153  90.8265  99.2249    99.2196  99.3659  99.3959 
Sensitivity comparison of computed spaced seeds of MegaBLAST weight
w  N  p  MegaBLAST  1 seed  Fast HC  

(contig.)  (spaced)  2 seeds  4 seeds  8 seeds  16 seeds  
28  100  0.90  39.1436  69.3241  79.6629  87.5674  92.7762  95.9170 
28  150  0.90  55.4870  87.6426  93.4308  97.0118  98.7430  99.5137 
28  200  0.90  67.4412  94.9876  97.8936  99.2937  99.7877  99.9409 
Discussion
We have provided a much faster implementation of the hill climbing heuristic of SpEED, the leading software for computing multiple spaced seeds in the Bernoulli model. Using the new implementation, some of the most challenging seeds have been improved and new, even more difficult ones, were provided. Still, many problems remain open in this important area. A modified heuristic is needed to be able to compare seeds of different lengths, as well as to address models different from Bernoulli.
Availability and requirements
Project name: SpEEDfast
Project home page: math.ryerson.ca/~silvana/SpEEDfast.cpp
Operating system(s): Platform independent
Programming language: C/C++
Other requirements: none
License: GNU GPL
Any restrictions to use by nonacademics: none
Availability of supporting data
The data sets supporting the results of this article are included within the article (and its Additional file 1).
Abbreviations
 BLAST:

Basic local alignment search tool
 OC:

Overlap complexity.
Declarations
Acknowledgements
Research partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC).
Authors’ Affiliations
References
 Lipman D, Pearson W: Rapid and sensitive protein similarity searches. Science. 1985, 227 (4693): 14351441. 10.1126/science.2983426.PubMedGoogle Scholar
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403410.PubMedGoogle Scholar
 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389.PubMedPubMed CentralGoogle Scholar
 Califano A, Rigoutsos I: FLASH: a fast lookup algorithm for string homology. Computer Vision and Pattern Recognition 1993 Proceedings CVPR'93 1993 IEEE Computer Society Conference on. 1993, 353359.Google Scholar
 Buhler J: Efficient largescale sequence comparison by localitysensitive hashing. Bioinformatics. 2001, 17 (5): 419428. 10.1093/bioinformatics/17.5.419.PubMedGoogle Scholar
 Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002, 18 (3): 440445. 10.1093/bioinformatics/18.3.440.PubMedGoogle Scholar
 Burkhardt S, Kärkkäinen J: Better Filtering with Gapped qGrams. Fundam Inform. 2003, 56 (12): 5170.Google Scholar
 Brown DG: A survey of seeding for sequence alignments. In Bioinformatics Algorithms: Techniques and Applications. Edited by: Mandoiu I, Zelikovsky A. 2007, Hoboken: J. Wiley and Sons Inc, 117142.Google Scholar
 Li M, Ma B, Kisman D, Tromp J: PatternHunterII: Highly Sensitive and Fast Homology Search. J Bioinformatics and Computational Biology. 2004, 2 (3): 417440. 10.1142/S0219720004000661.Google Scholar
 Noé L, Kucherov G: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res. 2005, 33 (suppl 2): W540W543.PubMedPubMed CentralGoogle Scholar
 Homer N, Merriman B, Nelson SF: BFAST: An Alignment Tool for Large Scale Genome Resequencing. PLoS One. 2009, 4 (11): e776710.1371/journal.pone.0007767.PubMedPubMed CentralGoogle Scholar
 Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: Accurate Mapping of Short Colorspace Reads. PLoS Comput Biol. 2009, 5 (5): e100038610.1371/journal.pcbi.1000386.PubMedPubMed CentralGoogle Scholar
 Feng S, Tillier ER: A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics. 2007, 23 (10): 11951202. 10.1093/bioinformatics/btm114.PubMedGoogle Scholar
 Ma B, Li M: On the complexity of the spaced seeds. J Comput Syst Sci. 2007, 73 (7): 10241034. 10.1016/j.jcss.2007.03.008.Google Scholar
 Ma B, Yao H: Seed Optimization Is No Easier than Optimal Golomb Ruler Design. APBC. 2008, 133144.Google Scholar
 Buhler J, Keich U, Sun Y: Designing seeds for similarity search in genomic DNA In Proceedings of RECOMB'03. 2003, New York: ACM, 6775.Google Scholar
 Kucherov G, Noé L, Roytberg MA: A Unifying Framework for Seed Sensitivity and its Application to Subset Seeds. J Bioinformatics and Computational Biology. 2006, 4 (2): 553570. 10.1142/S0219720006001977.Google Scholar
 Ilie L, Ilie S, Mansouri Bigvand A: SpEED: fast computation of sensitive spaced seeds. Bioinformatics. 2011, 27 (17): 24332434. 10.1093/bioinformatics/btr368.PubMedGoogle Scholar
 Ilie L, Ilie S: Multiple spaced seeds for homology search. Bioinformatics. 2007, 23 (22): 29692977. 10.1093/bioinformatics/btm422.PubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.