Sequence Searcher: A Java tool to perform regular expression and fuzzy searches of multiple DNA and protein sequences
© Upton et al; licensee BioMed Central Ltd. 2009
Received: 03 November 2008
Accepted: 30 January 2009
Published: 30 January 2009
Many sequence-searching tools have limiting factors for their use. For example, they may be platform specific, enforce restrictive size limits and sequences to be searched, or only allow searches of one of DNA or protein.
We present an easy-to-use, fast, platform-independent tool to search for amino acid or nucleotide patterns within one or many protein or nucleic acid sequences. The user can choose to search for regular expressions or perform a fuzzy search in which a particular number of errors is accepted during matching of a sequence. Positions of mismatches in fuzzy searches are displayed graphically the user.
SeqS provides an improved feature set and functions as a stand-alone tool or could be integrated into other bioinformatics platforms.
Searching for specific patterns in protein and DNA sequences is a common analysis performed by molecular biologists. Detection of restriction enzyme cleavage sites in DNA sequences was an early use of this pattern matching process. Later, as the protein databases grew, the PROSITE motif database was developed . These protein motifs are written as regular expressions that capture the variability within a consensus sequence from a short, highly conserved region in a multiple alignment. As the volume and diversity of genomic information grows, it is necessary to modify PROSITE patterns to allow them to match more diverse homologs. Searching through genomic sequences for conserved nucleotide patterns such as transcription factor binding sites is another use for this type of analysis. Large-scale sequencing has lead to some automated bioinformatics analyses, but pattern searching is such a common "hands on" interactive procedure that we have developed an easy-to-use tool, Sequence Searcher (SeqS), that supports searching for user-specified patterns in multiple protein and nucleotide sequences. SeqS has been integrated into several of the Viral Bioinformatics Resource Center tools and can therefore read sequences directly out of the VOCs database , but it also function as a stand-alone program with the ability to manage sequences much larger than viral genomes.
We implemented a brute-force fuzzy search algorithm and made use of the Jakarta ORO libraries  for Perl-like regular expressions. To speed up the searches and reduce memory requirements, an algorithm to reverse both fuzzy patterns and regular expressions was developed. In a search on DNA, SeqS first searches the top strand (the sequence itself), and then the bottom strand. However, rather than creating a duplicate (the bottom strand) of the nucleotide sequence, which is expensive in terms of memory, SeqS reverses and complements the query and searches the top strand again.
Upon completion of a search, the results panel is presented, showing the data in tabular format (Figure 1). It is possible to sort the table by any data column using the column header (sequence, match, match start, match stop, confidence, strand) as well as filter the results set by sequence and by strand using drop-down menus. The search parameters are also reported. To facilitate interpretation of fuzzy search results, an additional graphical representation of the match is included in the form of a multi-coloured line that is divided into a series of segments numbering equal to the length of the pattern match. A segment is coloured green if a character match is exact, orange if it matches an ambiguity character and red in the case of a mismatch. The user can choose to save the results (all or selected rows) to a tab-delimited text file that is easily imported into a spreadsheet for further analysis.
Examples of SeqS searches on DNA sequences totalling 170 MB. All searches took less than 4 seconds
Regular Expression Search
No. of mismatches
No. of hits
No. of hits
SeqS is a versatile tool that can be used as a stand-alone program or easily incorporated into more complex bioinformatics workbenches. It provides the ability to search multiple sequences in a single run with regular expressions or fuzzy patterns. Results are displayed in sortable tables and graphics are used to show fuzzy matches. To enable viewing of results with genome annotations, the core of SeqS has been incorporated in to the Viral Genome Organizer  and Base-By-Base  tools that can read GenBank files.
Availability and requirements
Project name: Sequence Searcher (SeqS)
Project homepage: http://www.virology.ca/tools/SequenceSearcher
Operating system: Platform independent
Programming language: Java
Other requirements: Java 1.4 or higher; Java Web Start
License: SeqS is distributed under the Open Software License.
Any restrictions to use by non-academics: None
This work was supported by NIAID grant HHSN266200400036C.
- Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 2002, 3 (3): 265-274. 10.1093/bib/3.3.265.View ArticlePubMedGoogle Scholar
- Ehlers A, Osborne J, Slack S, Roper RL, Upton C: Poxvirus Orthologous Clusters (POCs). Bioinformatics. 2002, 18 (11): 1544-1545. 10.1093/bioinformatics/18.11.1544.View ArticlePubMedGoogle Scholar
- Apache Software Foundation, Jakarta ORO. [http://jakarta.apache.org/oro/index.html]
- Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences. [http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html]
- Upton C, Hogg D, Perrin D, Boone M, Harris NL: Viral genome organizer: a system for analyzing complete viral genomes. Virus Res. 2000, 70 (1–2): 55-64. 10.1016/S0168-1702(00)00210-0.View ArticlePubMedGoogle Scholar
- Brodie R, Smith AJ, Roper RL, Tcherepanov V, Upton C: Base-By-Base: single nucleotide-level analysis of whole viral genome alignments. BMC Bioinformatics. 2004, 5: 96-10.1186/1471-2105-5-96.PubMed CentralView ArticlePubMedGoogle Scholar