Estimating overannotation across prokaryotic genomes using BLAST+, UBLAST, LAST and BLAT
© Moreno-Hagelsieb and Hudy-Yuffa; licensee BioMed Central Ltd. 2014
Received: 26 November 2013
Accepted: 11 September 2014
Published: 16 September 2014
As the number of genomes in public databases increases, it becomes more important to be able to quickly choose the best annotated genomes for further analyses in comparative genomics and evolution. A proxy to annotation quality is the estimation of overannotation by comparing annotated coding genes against the SwissProt database. NCBI’s BLAST (BLAST+) is the common software of choice to compare these sequences. Newer programs that run in a fraction of the time as BLAST+ might miss matches that BLAST+ would find. However, the results might still be useful to calculate overannotation. We thus decided to compare the overannotation estimates yielded using three such programs, UBLAST, LAST and the Blast-Like Alignment Tool (BLAT), and to test non-redundant versions of the SwissProt database to reduce the number of comparisons necessary.
We found that all, UBLAST, LAST and BLAT, tend to produce similar overannotation estimates to those obtained with BLAST+. As would be expected, results varied the most from those obtained with BLAST+ in genomes with fewer proteins matching sequences in the SwissProt database. UBLAST was the fastest running algorithm, and showed the smallest variation from the results obtained using BLAST+. Reduced SwissProt databases did not seem to affect the results much, but the reduction in time was modest compared to that obtained from UBLAST, LAST, or BLAT.
Despite faster programs miss sequence matches otherwise found by NCBI’s BLAST, the overannotation estimates are very similar and thus these programs can be used with confidence for this task.
The continuing growth of the number of genome sequences in public databases has become an always present meme in the introduction to most bioinformatics articles. It is therefore important to develop fast methods for analyzing such amounts of genomic data. Overannotation, an estimate on the proportion of false genes annotated into a genome, can work as a proxy to genome annotation quality (see examples of use at [1–6]). In this regard, Skovgaard et al. developed a method to estimate the number of genes that should be annotated in a genome. The method is based on comparing the proteins encoded by the annotated genes of a genome against the SwissProt database [8, 9] (see Data and methods for further details).
In a mini review about the 1000th genome deposited into public databases, Lagesen et al. made a point about the time required to analyze a high number of genomes. In the case of estimating overannotation by the SwissProt method, the bottleneck would be in comparing the annotated coding genes to the SwissProt database, which is commonly performed, like many other protein sequence comparisons, using some version of NCBI’s BLAST [11, 12]. However, there has been a few new programs promising to do a much faster work comparing sequences, such as the BLAST-Like Alignment Tool (BLAT) , LAST , and the sequence analysis multitool USEARCH , which contains UBLAST as a substitute for BLAST. While these algorithms produce results faster, the speed comes at the cost of missing some proportion of similar sequences. However, the calculation of overannotation by the SwissProt method depends on the proportion of large and small annotated proteins that find a match in the SwissProt database, rather than on the total number of sequences finding matches, and rather than on the number of matches found by each sequence. Therefore, even if these new sequence comparison tools miss matches that would be found by BLAST, the relative proportions of small and large proteins finding matches in the SwissProt database might be similar and, thus, render these newer sequence comparison tools just as useful and more efficient for the task of estimating overannotation.
In this work, we tested the performance of BLAT, LAST and UBLAST, for the specific task of estimating overannotation as compared against NCBI’s BLASTP+ . Since SwissProt contains redundant sequences, we also tested if we could reduce the database, by eliminating nearly identical sequences, without losing information towards estimating overannotation.
Data and methods
The version of the SwissProt database [8, 9] available by early December 2013 contained a total of 540,261 protein sequences. The quality of these sequences is annotated in a five level hierarchy. We wrote a program to remove any protein sequence with qualities 4 (Predicted) and 5 (Uncertain) leaving 522,651 sequences. The same program reduced the database by keeping only one example of any identical protein sequences taking the above 522,651 to 438,166 non-identical sequences, thus reducing the number of sequences in the database by approx. 16%. The UCLUST function of USEARCH7.0.959 (32 bit-compiled, which is the free version for academic and non-profit institutions) was used to produce SwissProt databases with only one representative for very similar sequences by clustering at 95, 90, 85, 80, 75 and 70% identity thresholds.
The version of NCBI’s BLAST (BLAST+) was 2.2.28+ (64 bit-compiled), LAST version was 392 (64 bit-compiled), BLAT was version 32 (64 bit-compiled). The version of UBLAST was the so named function implemented under USEARCH7.0.959 (kindly provided by the author 32 bit-precompiled). Both BLAST+ and UBLAST were run with default parameters, except for an E-value threshold of 1e-6. BLAT was run with default parameters. The first experiments, those used to compare processing speeds, were run in a late 2012 iMac. This computer was not running any other process during these experiments. Calculations for all 2700 genomes using BLAST+ were run in computer clusters kindly provided by SHARCNET. All the genome-to-SwissProt comparisons using the faster programs were run at the late 2012 iMac.
Overannotation was calculated using the SwissProt method described by . Briefly, the method estimates the number of genes that should be annotated in a genome by calculating the proportion of genes coding for proteins at least 200 amino-acid residues long (deemed as true genes), matching proteins in the SwissProt database (large SP-matching genes); and the proportion of small annotated genes, those that would code for proteins less than 200 amino-acid residues long, also matching SwissProt proteins (small SP-matching genes). The proportions are expected to be very similar if there is no overannotation. The lower the proportion of small SP-matching genes compared to that of large SP-matching genes, the higher the overannotation.
As mentioned, filtering out protein sequences labelled either “Predicted” or “Uncertain” from the SwissProt database left 522,651 sequences, while eliminating identical sequences left 438,166 sequences. Further clustering sequences using USEARCH’s UCLUST function left 339,818 sequences at 95%, 299,959 at 90%, 268,285 at 85%, 239,682 at 80% 214,044 at 75% and 190,445 at 70% identity thresholds.
Genomes used in the ten-genomes experiment
Organism (NCBI’s UID)
Streptococcus pneumoniae 670-6B (52533)
Bacillus cereus ATCC 10987 (57673)
Bacillus subtilis 168 (57675)
Escherichia coli K-12 MG1655 (57779)
Burkholderia pseudomallei 1710b (58391)
Burkholderia cenocepacia MC0-3 (58769)
Anaeromyxobacter sp. K (58953)
Methylobacterium nodulans ORS2060 (59023)
Coprothermobacter proteolyticus DSM5265 (59253)
Lactobacillus salivarius CECT5713 (162005)
Mycobacterium abscessus GO06 (170732)
Given the results of the experiments above, we further tested the difference in overannotation estimates for all the remaining prokaryotic genomes available in our database with all four programs (BLAST+, UBLAST, LAST and BLAT). We did not calculate time differences because BLAST+ would take too long to run in our machines. BLAST+ experiments were run in computer clusters. Since the time saved using reduced SwissProt databases was minimal, we made all of these comparisons against the clean SwissProt database with 438,166 non-identical sequences.
The Shared Hierarchical Academic Research Computing Network (SHARCNET). Work supported with WLU funds and Discovery Grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). The costs of publication have been covered by Wilfrid Laurier University.
- Ussery DW, Hallin PF:Genome update: annotation quality in sequenced microbial genomes. Microbiology. 2004, 150 (Pt 7): 2015-2017.PubMedView ArticleGoogle Scholar
- Moreno-Hagelsieb G:Operons across prokaryotes: genomic analyses and predictions 300+ genomes later. Curr Genomics. 2006, 7: 163-170. 10.2174/138920206777780247.View ArticleGoogle Scholar
- Moreno-Hagelsieb G, Janga SC:Operons and the effect of genome redundancy in deciphering functional relationships using phylogenetic profiles. Proteins. 2008, 70 (2): 344-352.PubMedView ArticleGoogle Scholar
- Ely B, Scott LE:Correction of the Caulobacter crescentus NA1000 genome annotation. PLoS ONE. 2014, 9 (3): e91668-10.1371/journal.pone.0091668.PubMedPubMed CentralView ArticleGoogle Scholar
- Samayoa J, Yildiz FH, Karplus K:Identification of prokaryotic small proteins using a comparative genomic approach. Bioinformatics (Oxford, England). 2011, 27 (13): 1765-1771. 10.1093/bioinformatics/btr275.View ArticleGoogle Scholar
- Klassen JL, Currie CR:ORFcor: identifying and accommodating ORF prediction inconsistencies for phylogenetic analysis. PLoS ONE. 2013, 8 (3): e58387-10.1371/journal.pone.0058387.PubMedPubMed CentralView ArticleGoogle Scholar
- Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A:On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001, 17 (8): 425-428. 10.1016/S0168-9525(01)02372-1.PubMedView ArticleGoogle Scholar
- Bairoch A, Boeckmann B:The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1991, 19 Suppl: 2247-9.PubMedView ArticleGoogle Scholar
- UniProt Consortium:Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41 (D1): D43-D47.View ArticleGoogle Scholar
- Lagesen K, Ussery DW, Wassenaar TM:Genome update: the 1000th genome–a cautionary tale. Microbiology. 2010, 156 (Pt 3): 603-608.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ:Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL:BLAST+: architecture and applications. BMC Bioinformatics. 2009, 10: 421-10.1186/1471-2105-10-421.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ:BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-64. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC:Adaptive seeds tame genomic sequence comparison. Genome Res. 2011, 21 (3): 487-493. 10.1101/gr.113985.110.PubMedPubMed CentralView ArticleGoogle Scholar
- Edgar RC:Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26 (19): 2460-1. 10.1093/bioinformatics/btq461.PubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR:NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35 (Database issue): D61-5.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.