An efficient algorithm for systematic analysis of nucleotide strings suitable for siRNA design
© Baranova et al; licensee BioMed Central Ltd. 2011
Received: 5 November 2010
Accepted: 27 May 2011
Published: 27 May 2011
The "off-target" silencing effect hinders the development of siRNA-based therapeutic and research applications. Existing solutions for finding possible locations of siRNA seats within a large database of genes are either too slow, miss a portion of the targets, or are simply not designed to handle a very large number of queries. We propose a new approach that reduces the computational time as compared to existing techniques.
The proposed method employs tree-based storage in a form of a modified truncated suffix tree to sort all possible short string substrings within given set of strings (i.e. transcriptome). Using the new algorithm, we pre-computed a list of the best siRNA locations within each human gene ("siRNA seats"). siRNAs designed to reside within siRNA seats are less likely to hybridize off-target. These siRNA seats could be used as an input for the traditional "set-of-rules" type of siRNA designing software. The list of siRNA seats is available through a publicly available database located at http://web.cos.gmu.edu/~gmanyam/siRNA_db/search.php
In attempt to perform top-down prediction of the human siRNA with minimized off-target hybridization, we developed an efficient algorithm that employs suffix tree based storage of the substrings. Applications of this approach are not limited to optimal siRNA design, but can also be useful for other tasks involving selection of the characteristic strings specific to individual genes. These strings could then be used as siRNA seats, as specific probes for gene expression studies by oligonucleotide-based microarrays, for the design of molecular beacon probes for Real-Time PCR and, generally, any type of PCR primers.
siRNA-based silencing of the gene expression involves homology-dependent suppression of the cognate mRNA either at the transcriptional or post-transcriptional level . Most important part of this process involves an interaction of target mRNA with string-specific double-strand RNA molecules (siRNAs) of about 21 nt with 3'-overhangs . An annealing of siRNA to unrelated but partially homologous mRNAs produces interference with the silencing process leading to a diminished efficiency . Additionally, mRNAs with partial homology to siRNA molecules may also be degraded to some extent, evoking unwanted physiological effects . In the clinical settings, e.g. when siRNA is applied as an antiviral treatment, it may lead to imbalance of the normal cellular functions that could, in turn, manifests as side effects of the therapy. This phenomenon, called 'off-target' silencing, is known as one of the most serious problems in RNA interference (RNAi) [3, 5]. Until major improvement in siRNA design occurs, both the development of siRNA-based therapeutic applications and interpretation of gene function and phenotypes resulting from RNAi experiments will be hindered.
When tested in vivo, about 80% of theoretically possible mammalian siRNAs were shown to be not functional or suboptimal . To improve siRNA design, a set of rules for detecting 21-mer target sites was proposed, including a low G+C content, a lack of internal repeats and an A/U-rich 5' end . The importance of certain secondary structures at the siRNA target site  and the absence of the short string matches to the 3' areas of other human genes  were emphasized. A number of reliable algorithms for the prediction of highly specific and efficient siRNAs have been published [Rev. in ]. Nevertheless, minimization of the siRNA off-target effects still needs major improvement.
A typical approach for off-target effects reduction is by the similarity search with the basic local alignment search tool (BLAST) using the organism-specific transcriptome dataset . Use of the BLAST algorithm promptly returns possible secondary targets, but a proportion of the significant alignments may be missed . On the other hand, an exhaustive Smith-Waterman local alignment algorithm  returns accurate answers but is so time-consuming that it often requires hardware augmentation [14, 15]. Several authors proposed adjustments for mismatch tolerance [12, 16] that may lower the effectiveness of siRNA found and, again, are costly to calculate.
One of the possible ways to increase the speed of calculations without losing its specificity and sensitivity is to pre-compute transcriptome-specific sets of gene-specific strings with decreased redundancy ("siRNA seats"). For example, Naito et al. aligned all the human RefSeq and UniGene strings onto the human genomic strings, and retrieved duplicate-free exons and strings over exon-exon junctions and pre-computed gene-specific 19-nt strings with a smaller number of collaborative off-target hits, defined as complete or partial matches of multiple 19 nt substrings . Although representing an important step forward, this approach yields siRNA candidates that may still cause an off-target effect as the stretches of as few as 11-to-15 consecutive nts are enough to produce unwanted silencing .
Next improvement has been made by the Comprehensive Redundancy Minimizer (CRM) algorithm that allows one to map all unique short-string strings ("kernels") 9-to-15 nt in size (length "N") within large sets of strings, e.g. an entire transcriptome . CRM algorithm ensures that every predicted siRNA seat of length 21 is comprised of overlapping kernels of length N, where N is between 9 and 17. The CRM-based filtering was tested on two complete transcriptomes, human and murine, and proven efficient using the collection of published sets of siRNAs with known efficacies .
Here we suggest an alternative to CRM algorithm that highlights gene-specific siRNA seats with minimized off-target annealing in a cost-efficient way. Our algorithm relies on a search efficient truncated suffix tree data structure. The tree-based organization provides for the saving of the computation time when it comes to both storage and searching for substrings within the gene. The algorithm outputs results in an easily reusable tab-delimited form.
The idea of suffix trees dates back to the concept of a position tree introduced in . The construction was greatly simplified by McCreight , and also by Ukkonen . Ukkonen provided the first linear-time online construction of suffix trees, now known as Ukkonen's algorithm. This data structure is reminiscent of the binary trees widely used in computer science and in fact suffix trees have been used in the information science literature . In recent years, the concept has found numerous applications in computational biology [24–27]. The data structure we used in this work is closely related to the truncated suffix trees utilized in [25–27]; however, it contains the information about the positions of the substrings in the database and lacks horizontal links which make it more suitable for the siRNA application at hand.
Using the new algorithm, we pre-computed a list of the best siRNA locations within each human gene ("siRNA seats"). The complete list of siRNA locations with minimized off-target hybridization is available at http://web.cos.gmu.edu/~gmanyam/siRNA_db/search.php). These siRNA seats could be used as an input for the traditional "set-of-rules" type of siRNA designing software.
Data structure and problem formulation
Let us now explain in details the type of structure our algorithm for sorting and analyzing the substrings within the entire transcriptome is based on. Each substring of a certain length n is stored in a modified n-truncated suffix tree with each node having 4 pointers associated with the nucleotides A,C,G or T, so each string is represented by a unique path from the root of the tree to its leaf. The string storing procedure is carried out in the following way.
Procedure 1 [String Storage Procedure] We create a branch from the root of the tree to a vertex in the first level of the tree corresponding to the first character in the string (A,C,G or T). We then create a branch from this vertex to a vertex in the second level corresponding to the second character, and so on until we have reached the n th level of the tree, where n is the total number of characters in the string. If a certain branch of the tree already exists, we simply follow that branch to the next level of the tree without the need to create a new branch.
In the tree, each string is accompanied by certain information, such as its frequency, gene of origin and the position within the gene. String-specific information is stored in each of the leaves, allowing one to avoid unnecessary string comparisons. Furthermore, the storage space is generated on demand, so that no memory is wasted. We call the resulting tree structure with all the information stored in the leaves a modified n-truncated suffix tree.
The approach described above is similar to the suffix tree construction used in the Entropic Profiler software introduced in . However, instead of connecting nodes at the same depth within the tree with "side links", we simply store the location of each substring within the database which enables us to save on storage and to "sweep" through the siRNA seat computation as quickly as possible. To explain this distinction, let us formalize some important notions to be used in the description of the proposed algorithm.
Denote G a set of genes, represented as strings. Consider two integers n and N, with n < N, with n playing a role of a threshold. Let be the set of all substrings of length N in the set G (referred to as N-strings). Two strings x and y in U are called duplicate N-strings if and only if they belong to different genes and contain at least one pair of substrings of length n (referred to as n-strings) which are equal (i.e. there is a substring of x of length which is equal to some substring of y of the same length). A string x is called a unique N-string if it is not a duplicate of any other string in U. Any unique N-string x in U forms a siRNA seat .
The problem to be solved by means of these tools can be formally described as follows:
Given a gene belonging to G find, from left to right, all maximal substrings of length N that are unique according to the threshold n. Generate a database consisting of these unique strings (siRNA seats) collected from all genes in G.
Notice that all seats are of the same length N unless they happen to be on the boundary of the original gene, in which case they can be as short as n characters. The minimal length of a siRNA seat is n.
Given a database of genes and a fixed positive integer n, the algorithm stores all n- substrings for the entire collection of genes. Additionally, for any n-string, we store the location of each occurrence of this n-string under the same index of the tree. Sorting the strings facilitates downstream analysis on the data. More precisely, the tree-sort algorithm takes as input a database of genes, comprised of the nucleotides A, C, G, and T. To store all possible n- strings, we need a full tree with n+ 1 levels (counting the root), labelled 0 through n. Since there are four types of nucleotides, the kth level has at most 4 k vertices. Each vertex in the kth level has branches to four vertices in the (k+ 1)th level. In order to locate siRNA seats once the suffix tree has been built, we first print the list of all unique substrings in the list UNIQUE_n, which also contains the associated gene symbol and the location of the substring within that gene. After the unique strings have been identified in this fashion, the siRNA "seats" can be generated on-the-fly without re-reading the transcriptome. Indeed, for each unique subsequence specified in the list UNIQUE_n, we need to look at the N - n characters succeeding it in the original gene. If not all of the N - n characters are available due to the proximity of the gene boundary, only available characters are taken into consideration. All strings of length N in the resulting set are checked for uniqueness. Each unique string found this way forms a new siRNA seat. Algorithm 1 formalizes the steps described above.
Algorithm 1. Suffix tree-based calculation of siRNA seats of length N with threshold n
Input: G - the set of all genes in the database, a threshold value n and the siRNA length N.
Output: siRNA seats of length N with threshold value n.
Put i = 1. Initialize the vector [0, ..., 0].
While (the set of remaining n-strings in G is non-empty)
Read an n-string from and denote it s i . Store s i in the suffix tree according to Procedure 1. Store the corresponding gene ID and the location of the string within that gene in the leaf. If the substring s i already exists in the tree and belongs to a different gene, set counter i = 1 to reflect the fact it is a duplicate.
i = i + 1
Identify leaves with counter i = 0 (unique n-strings) and save their location in UNIQUE_n.
Let n unique be the total number of n-strings in UNIQUE_n.
For (j from 1 to n unique )
For the j-th n-string in the list UNIQUE_n, find a (larger) N-string containing it by scanning all available N - n characters to the right of it in the same gene (found by gene ID). Mark the resulting N-string as unique if all of its n-substrings exist in the list UNIQUE_n (Definition 1).
Return all unique N-strings - these represent all siRNA seats found in G.
It is worth noting that the tree construction utilized by Algorithm 1 allows for quick modification of the results in case new genes are added to the database, with no need to re-create the suffix tree. The new gene information in the form of a collection of -strings will be recorded in the tree based on Procedure 1 and which will result in incremental changes to the list UNIQUE_n. Any possible duplicates arising from this change will be immediately detected when performing Steps 4-5 of the Algorithm and will allow for fast recalculation of the siRNA seats.
The new algorithm requires the following four main types of memory: character arrays to store the gene strings, gene structures containing pointers to the each field in the gene strings, storage structures to track the location of each n-string, and the tree structures to sort the n -strings. Suppose our database of genes has G genes with average length L. Then the memory for the gene structures is G ΔL. Each gene structure contains a pointer to the gene cluster, a pointer to the gene ID, a pointer to the actual gene string, and an integer containing the length of the string. Using 32-bit pointers, each gene structure requires 16 bytes of memory, and we need G gene structures. Thus the total memory for the gene structures is 16G bytes. We need a storage structure for each N-string in the database. That number is equal to G(ΔL + 1- n). Since n ≪ ΔL, we use GΔL. The storage structure contains a pointer to the associated gene, an integer specifying the location of the N-string in the gene, and a pointer to another storage structure containing the previous occurrence of this exact N-length string. So the total memory for the storage structures is approximately 12G ΔL bytes.
The size and fill-in of the tree needed to store the initial dataset and the memory consumption
Branches in Full Tree
% Branches Used
Full Tree Memory
Actual Tree Memory
Implementation and testing
To demonstrate the utility of the novel algorithm, we applied it to parse a non-redundant set of human transcripts onto the 11 to 17 nucleotides substring and extract the sets of siRNA seats comprised of substrings with given length (Figure 3).
Human mRNA strings were extracted from the NCBI Unigene dataset (build #219). For each gene, the longest reference mRNA string with NM identified was extracted and further processed using the new algorithm.
Predicted siRNA seats were placed in a mySQL database. To provide an access to the siRNA seats stored in this database, a web interface was built using PHP. User-friendly interface of the database allows the search with the HUGO approved Gene Symbol, NCBI Entrez Gene ID, Genbank Acession or Unigene cluster ID for siRNA seats comprised of unique oligonucleotides with selected length. As an output, siRNA seat database lists all seats with lengths equal or larger than 19 nucleotides, with their relative positions within respective mRNA string. For the convenience of the user, each siRNA seat search also returns the string of the mRNA template used for the tree parsing and some general information about the gene of query including its exon/intron structure. We envision that users may consult siRNA seats database before embarking on siRNA design as gene-specific lists of siRNA seats with minimized off-target hybridization may be used as input for any conventional siRNA designing software instead of the entire string corresponding to the gene of interest. The searchable database of all possible human siRNA seats is available at http://web.cos.gmu.edu/~gmanyam/siRNA_db/search.php.
String-specific small interfering RNAs (siRNAs) could be used both as therapeutic molecules and as a new instrument for a drug target discovery. Cellular and animal models already demonstrated the potential of siRNA-based treatments for cancer, viral infections and inflammatory diseases. However, the development of siRNA based therapeutics is hampered by 'off-target' silencing effects that have to be minimized in order to diminish the possibility of the side effects.
One relatively straightforward approach to 'off-target' minimization is to design siRNA molecules for pairing up with unique locations within mRNA targets. The traditional "bottom-up" approach to siRNA design implies an exclusion of any possible short string matches using BLAST or Smith-Waterman algorithm. On the other hand, one may employ "top-down" approach by using a pre-computed set of least redundant locations within the entire transcriptome (siRNA seats) as input for the traditional siRNA designing software. "Top down" approach requires less siRNA designing skills form a novice researcher and limits the set of gene-specific candidate siRNAs to a smaller number of molecules in need of experimental verification. However, the transcriptome-wide extraction of the least redundant substrings is not a trivial task. The first algorithm of this kind, CRM  that successfully completed the task, was far from efficient.
Here we propose a substantially more efficient algorithm that employs tree based storage of the substrings, which is the first application of this mathematical concept in this context. The approach developed here is not limited to optimal siRNA design, but can also be useful for other tasks, such as selecting characteristic strings specific to individual genes in certain organisms. These strings could then be used as siRNA seats, as specific probes for gene expression studies by oligonucleotide-based microarrays, for the design of molecular beacon probes for Real-Time PCR and, generally, any type of PCR primers.
Another important advantage of the new algorithm over CRM is that the storage structure created by the new algorithm automatically records the frequency for each substring. Therefore, this suffix tree based approach can be easily utilized to perform other types of transcriptome analysis, including a search for unique substrings and absent substrings, analysis of distributions of the substrings associated with various biological features, e.g. promoters, 3' untranslated regions and open reading frames. This further analysis is the subject of an ongoing study.
Among the limitations of the proposed algorithm are 1) the necessity of the periodical re-analysis of the available siRNA seats within transcriptome in order to incorporate newly discovered functional RNA transcripts; and 2) inevitable miss of the imperfect siRNA seats that might couple with respective siRNA and act as "seed rule" violation but nonetheless efficient miRNAs instead. Latter possibility needs to be studied experimentally by systematic analysis of the rejected siRNA seats using miRNA recognizing algorithms, and is included in the plan for the future development.
Here we present a new efficient suffix tree-based algorithm that delivers a comprehensive and systematic analysis of substrings within an arbitrary set of biological strings. The proposed algorithm may help to find biologically significant features within large gene databases. In this paper, we described an application of this algorithm to exhaustive search for the "siRNA seats" in entire human transcriptome. Resulting database of siRNA seats is available at http://web.cos.gmu.edu/~gmanyam/siRNA_db/search.php.
Authors are grateful to Dr. Tariq Alsheddi for making the CRM code available for the comparative study and to Dr. Maria Stepanova for insightful discussions. We would also like to gratefully acknowledge the input provided by anonymous referees which greatly improved the quality of this publication.
This research was partially covered by Russian Ministry of Science grant 02.512.12.2060 "Development of efficient approaches for in vivo delivery of the genetic information into target cells for the therapy of socially important diseases".
- Verdel A, Vavasseur A, Le Gorrec M, Touat-Todeschini L: Common themes in siRNA-mediated epigenetic silencing pathways. Int J Dev Biol. 2009, 53: 245-257. 10.1387/ijdb.082691av.PubMedView Article
- Scherr M, Morgan MA, Eder M: Gene silencing mediated by small interfering RNAs in mammalian cells. Curr Med Chem. 2003, 10: 245-256.PubMedView Article
- Svoboda P: Off-targeting and other non-specific effects of RNAi experiments in mammalian cells. Curr Opin Mol Ther. 2007, 9: 248-257.PubMed
- Scacheri PC, Rozenblatt-Rosen O, Caplen NJ, Wolfsberg TG, Umayam L, Lee JC, Hughes CM, Shanmugam KS, Bhattacharjee A, Meyerson M, Collins FS: Short interfering RNAs can induce unexpected and divergent changes in the levels of untargeted proteins in mammalian cells. Proc Natl Acad Sci USA. 2004, 101: 1892-1897. 10.1073/pnas.0308698100.PubMedPubMed CentralView Article
- Hajeri PB, Singh SK: siRNAs: their potential as therapeutic agents - Part I. Designing of siRNAs. Drug Discov Today. 2009, 14: 851-858. 10.1016/j.drudis.2009.06.001.PubMedView Article
- Ui-Tei K, Naito Y, Takahashi F, Haraguchi T, Ohki-Hamazaki H, Juni A, Ueda R, Saigo K: Guidelines for the selection of highly effective siRNA strings for mammalian and chick RNA interference. Nucleic Acids Res. 2004, 32: 936-948. 10.1093/nar/gkh247.PubMedPubMed CentralView Article
- Hsieh AC, Bo R, Manola J, Vazquez F, Bare O, Khvorova A, Scaringe S, Sellers WR: A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens. Nucleic Acids Res. 2004, 32: 893-901. 10.1093/nar/gkh238.PubMedPubMed CentralView Article
- Hung CF, Lu KC, Cheng TL, Wu RH, Huang LY, Teng CF, Chang WT: A novel siRNA validation system for functional screening and identification of effective RNAi probes in mammalian cells. Biochem Biophys Res Commun. 2006, 346: 707-720. 10.1016/j.bbrc.2006.05.164.PubMedView Article
- Jackson AL, Linsley PS: Noise amidst the silence: off-target effects of siRNAs?. Trends Genet. 2004, 20: 521-524. 10.1016/j.tig.2004.08.006.PubMedView Article
- Tilesi F, Fradiani P, Socci V, Willems D, Ascenzioni F: Design and validation of siRNAs and shRNAs. Curr Opin Mol Ther. 2009, 11: 156-164.PubMed
- Cui W, Ning J, Naik UP, Duncan MK: OptiRNAi, an RNAi design tool. Comput Methods Programs Biomed. 2004, 75: 67-73. 10.1016/j.cmpb.2003.09.002.PubMedView Article
- Naito Y, Yamada T, Matsumiya T, Ui-Tei K, Morishita S: dsCheck: highly sensitive off-target search software for double-stranded RNA-mediated RNA interference. Nucleic Acids Res. 2005, 33: W589-W591. 10.1093/nar/gki419.PubMedPubMed CentralView Article
- Dai X, Zhao PX: pssRNAMiner: a plant short small RNA regulatory cascade analysis server. Nucleic Acids Res. 2008, 6: W114-11.View Article
- Li IT, Shum W, Truong K: 160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA). BMC Bioinformatics. 2007, 8: 185-10.1186/1471-2105-8-185.PubMedPubMed CentralView Article
- Manavski SA, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman string alignment. BMC Bioinformatics. 2008, 9: S10-PubMedPubMed CentralView Article
- Chalk AM, Sonnhammer EL: siRNA specificity searching incorporating mismatch tolerance data. Bioinformatics. 2008, 24: 1316-1317. 10.1093/bioinformatics/btn121.PubMedView Article
- Jackson AL, Bartz SR, Schelter J, Burchard J, Mao M, Li B, Cavet G, Linsley PS: Expression profiling reveals off-target gene regulation by RNAi. Nat Biotechnol. 2003, 21: 635-637. 10.1038/nbt831.PubMedView Article
- Alsheddi T, Vasin L, Meduri R, Randhawa M, Glazko G, Baranova A: siRNAs with high specificity to the target: a systematic design by CRM algorithm. Mol Biol (Mosk). 2008, 42: 163-171. 10.1134/S0026893308010251.View Article
- Saetrom P, Snove O: A comparison of siRNA efficacy predictors. Biochem Biophys Res Commun. 2004, 321: 247-253. 10.1016/j.bbrc.2004.06.116.PubMedView Article
- Weiner P: Linear pattern matching algorithm. 14th Annual IEEE Symposium on Switching and Automata Theory. 1973, 1-11.View Article
- McCreight EM: A space-economical suffix tree construction algorithm. Journal of the ACM. 1976, 23: 262-272. 10.1145/321941.321946.View Article
- Ukkonen E: On-line construction of suffix trees. Algorithmica. 1995, 14 (3): 249-260. 10.1007/BF01206331.View Article
- Na J, Apostolico A, Iliopoulos CS, Park K: Truncated suffix trees and their application to data compression. Theoretical Computer Science. 2003, 304: 87-101. 10.1016/S0304-3975(03)00053-7.View Article
- Giegerich R, Kurtz S: From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Algorithmica. 1997, 19 (3): 331-353. 10.1007/PL00009177.View Article
- Schulz MH, Bauer S, Robinson PN: The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences. International journal of bioinformatics research and applications. 2008, 4: 81-95. 10.1504/IJBRA.2008.017165.PubMedView Article
- Fernandes F, Freitas AT, Almeida JS, Vinga S: Entropic Profiler - detection of conservation in genomes using information theory. BMC Research Notes. 2009, 2: 72-10.1186/1756-0500-2-72.PubMedPubMed CentralView Article
- Apostolico A, Denas O: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms for Molecular Biology. 2008, 3: 13-10.1186/1748-7188-3-13.PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/2.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.