BIGrat: a repeat resolver for pyrosequencing-based re-sequencing with Newbler
© Zhang et al.; licensee BioMed Central Ltd. 2012
Received: 6 February 2012
Accepted: 25 September 2012
Published: 15 October 2012
As more and more reference genome sequences are assembled, it becomes practical to assemble individual genomes from large amount of raw read data based on a reference sequence. However, most available assembly tools are designed for de-novo genome assembly. There is one commercial tool box (Newbler) developed for re-sequencing projects based on the Roche 454 sequencing platform. However, the genome with large repeat regions cannot be well assembled in Newbler.
We developed a new sequence assembly tool (BIGrat, Beijing Institute of Genomics Re-Assembly Tool) for pyrosequencing-based re-sequencing projects, such as data generated from Roche 454 and IonTorrent platforms. BIGrat improves the output of Newbler when evaluated on genome assemblies including chloroplast, mitochondrial, bacterial, and plant nuclear genomes.
We presented a novel sequence assembly tool BIGrat for pyrosequencing-based re-sequencing projects, which can easily be integrated into Newbler pipelines for next-generation sequencing assembly and analysis.
Together with the efficient application of next-generation sequencing technologies to genome sequencing, reference genomes of representative and important species in a broad spectrum of organisms are acquired, being sequenced, and re-sequenced. It becomes important that tools for assembling re-sequenced genomes from high-throughput data are readily available and specifically tuned to particular data types, such as those from ligase-based or polymerase-based protocols. Most currently available assembly tools have been designed for de-novo genome assembly, such as Velvet. Recently, several new tools are under development for re-sequencing projects. For example, LOCAS is designed for low coverage assembly of eukaryotic genomes. A commercial tool box developed for re-sequencing projects based on the Roche 454 sequencing platform is designed to assemble both de-novo and re-sequencing data. Here, we report a homology-guided method as a new r e-sequencing a ssembly t ool named BIGrat and its testing results for improving the output of the commercial tool Newbler. We believe that BIGrat will be widely used and integrated to the pipeline of next-generation sequencing projects.
The test datasets
Data for assembling rice chloroplast (cp), mitochondrial (mt), and nuclear genomes are all from a genome re-sequencing project for a rice cultivar PA64S (Oryza sativa L.). Data for bacterial genome assembly are from Acinetobacter baumannii MDR-ZJ06.
First, we use Newbler to mapping the raw data to reference genome and the mapping result will in a file named “454AllContigs.fna”, which stands for the assembled contigs. In order to keep the good and large assembled contigs, in which it means less repeat sequences than rest, we filter the contigs smaller than a gap size (such as 1 kb) but record the those contig coordinates as repeats in the reference genome. In addition, a file named “454PairAlign.txt” also presents in the mapping result and includes all the mapped reads and position in the reference genome. Second, we filter all the reads belong to each repeat in the reference genome and re-assembler each repeat separately to get the new contigs. Normal, the new contigs will better than the filtered one and have a complete repeat region. Last, we combine the initial good assembled contigs and the new contigs in repeats. This can be done with the raw data aligned to the each end of those contigs. We find the overlap in the ends of those contigs and construct the consensus sequences as the last contigs.
Results and discussion
Program comparison and assessment
The performance of Newbler and Newbler-BIGrat in assembling different genomes
Genome size (bp)
Contig length (bp)
Contig NG50 2
Contig LG50 3
Gap-filling number 4
Gap-filling length 4
Rice PA64S nuclear
Rice PA64S mt
Rice PA64S cp
BIGrat separates repeat regions in the reference sequence, iteratively fills the gaps caused by the repeats, and assembles the sequence to completion at the end. The main parameter setting is the gap size that is the sum of reassembled repeat regions. We test this parameter from 30 bp to 10,000 bp in PA64S chromosome 1. The result showed that 500 bp is an optimal gap size for BIGrat assembly (Additional file3: Figure S3). This gap size can also be determined based on the sequencing read length. Since the read lengths of the pyrosequencing platforms are ~500 bp from Roche 454 and ~200 bp from IonTorrent, most of the repeats smaller than 200 bp or 500 bp may be assembled based on sequencing reads alone. As the gap size grows, the BIGrat’s running time also increases linearly. For example, the system running times are 54 min, 102 min, and 126 min when gap sizes change from 30 bp to 500 bp and 10,000 bp, respectively.
We illustrated an informatics tool BIGrat ( Additional file4) to improve genome assemblies for pyrosequencing-based re-sequencing projects and showed that BIGrat is an add-on tool to Newbler. BIGrat is easily to be integrated into Newbler for next-generation sequencing assembly and analysis. Because of the limitation to pyrosequencing data and Newbler software, we will update BIGrat software to improve assembly results from all sequencing platforms in next step.
Availability and requirements
Project name: BIGrat
Project home page:http://sourceforge.net/projects/bigrat/
Operating system(s): Linux Platform
Programming language: Perl
Other requirements: Newbler (version > 2.3)
License: GNU General Public License
Any restrictions to use by non-academics: -
The study is supported by grants from Knowledge Innovation Program of the Chinese Academy of Sciences (KSCX2-EW-R-01-04), Natural Science Foundation of China (90919024), Natural Science Foundation of China (30900831), and the National Basic Research Program (973 Program) from the Ministry of Science and Technology of the People’s Republic of China (2011CB944100).
- Chen Hua L, Chun Yuan L, Chuan Yi T: A re-sequencing tool for high mismatch-tolerant short read alignment based on Burrows-Wheeler Transform. Bioinformatics and Biomedicine Workshops (BIBMW). 2010, Hsinchu, Taiwan: IEEE International Conference, 549-554.Google Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMedPubMed CentralView ArticleGoogle Scholar
- Klein JD, Ossowski S, Schneeberger K, Weigel D, Huson DH: LOCAS–a low coverage assembly tool for resequencing projects. PLoS One. 2011, 6 (8): e23455-10.1371/journal.pone.0023455.PubMedPubMed CentralView ArticleGoogle Scholar
- Tian XJ, Zheng J, Hu SN, Yu J: The rice mitochondrial genomes and their variations. Plant Physiol. 2006, 140 (2): 401-410. 10.1104/pp.105.070060.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhou H, Zhang T, Yu D, Pi B, Yang Q, Zhou J, Hu S, Yu Y: Genomic analysis of the multidrug-resistant Acinetobacter baumannii strain MDR-ZJ06 widely spread in China. Antimicrob Agents Chemother. 2011, 55 (10): 4506-4512. 10.1128/AAC.01134-10.PubMedPubMed CentralView ArticleGoogle Scholar
- Tsai IJ, Otto TD, Berriman M: Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010, 11 (4): R41-10.1186/gb-2010-11-4-r41.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang T, Zhang X, Hu S, Yu J: An efficient procedure for plant organellar genome assembly, based on whole genome data from the 454 GS FLX sequencing platform. Plant methods. 2011, 7 (1): 38-10.1186/1746-4811-7-38.PubMedPubMed CentralView ArticleGoogle Scholar
- Tang J, Xia H, Cao M, Zhang X, Zeng W, Hu S, Tong W, Wang J, Wang J, Yu J, et al: A comparison of rice chloroplast genomes. Plant Physiol. 2004, 135 (1): 412-420. 10.1104/pp.103.031245.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.