- Technical Note
- Open Access
BIGrat: a repeat resolver for pyrosequencing-based re-sequencing with Newbler
BMC Research Notes volume 5, Article number: 567 (2012)
As more and more reference genome sequences are assembled, it becomes practical to assemble individual genomes from large amount of raw read data based on a reference sequence. However, most available assembly tools are designed for de-novo genome assembly. There is one commercial tool box (Newbler) developed for re-sequencing projects based on the Roche 454 sequencing platform. However, the genome with large repeat regions cannot be well assembled in Newbler.
We developed a new sequence assembly tool (BIGrat, Beijing Institute of Genomics Re-Assembly Tool) for pyrosequencing-based re-sequencing projects, such as data generated from Roche 454 and IonTorrent platforms. BIGrat improves the output of Newbler when evaluated on genome assemblies including chloroplast, mitochondrial, bacterial, and plant nuclear genomes.
We presented a novel sequence assembly tool BIGrat for pyrosequencing-based re-sequencing projects, which can easily be integrated into Newbler pipelines for next-generation sequencing assembly and analysis.
Together with the efficient application of next-generation sequencing technologies to genome sequencing, reference genomes of representative and important species in a broad spectrum of organisms are acquired, being sequenced, and re-sequenced. It becomes important that tools for assembling re-sequenced genomes from high-throughput data are readily available and specifically tuned to particular data types, such as those from ligase-based or polymerase-based protocols. Most currently available assembly tools have been designed for de-novo genome assembly, such as Velvet. Recently, several new tools are under development for re-sequencing projects. For example, LOCAS is designed for low coverage assembly of eukaryotic genomes. A commercial tool box developed for re-sequencing projects based on the Roche 454 sequencing platform is designed to assemble both de-novo and re-sequencing data. Here, we report a homology-guided method as a new r e-sequencing a ssembly t ool named BIGrat and its testing results for improving the output of the commercial tool Newbler. We believe that BIGrat will be widely used and integrated to the pipeline of next-generation sequencing projects.
The test datasets
Data for assembling rice chloroplast (cp), mitochondrial (mt), and nuclear genomes are all from a genome re-sequencing project for a rice cultivar PA64S (Oryza sativa L.). Data for bacterial genome assembly are from Acinetobacter baumannii MDR-ZJ06.
BIGrat is based on the mapping result of Newbler and its mapping model. Newbler is not able to assemble repeat sequences in the reference genome correctly and produces many small contigs separated by repeat regions (Additional file1: Figure S1) but the reads in each repeat region can be assembled separately to completion. Therefore, BIGrat separates the repeat regions with a fixed gap size, and assemble every repeat region iteratively with mapped reads (Figure1). Such an iterative assembly method has been used in IMAGE and LOCAS.
First, we use Newbler to mapping the raw data to reference genome and the mapping result will in a file named “454AllContigs.fna”, which stands for the assembled contigs. In order to keep the good and large assembled contigs, in which it means less repeat sequences than rest, we filter the contigs smaller than a gap size (such as 1 kb) but record the those contig coordinates as repeats in the reference genome. In addition, a file named “454PairAlign.txt” also presents in the mapping result and includes all the mapped reads and position in the reference genome. Second, we filter all the reads belong to each repeat in the reference genome and re-assembler each repeat separately to get the new contigs. Normal, the new contigs will better than the filtered one and have a complete repeat region. Last, we combine the initial good assembled contigs and the new contigs in repeats. This can be done with the raw data aligned to the each end of those contigs. We find the overlap in the ends of those contigs and construct the consensus sequences as the last contigs.
Results and discussion
Program comparison and assessment
To evaluate the performance of BIGrat, we used four different genomes against Newbler with its default parameter settings. In addition, we compared assembled results with consensus sequences from BWA-SW/SAMtools. The four genomes are re-sequencing projects carried out at the Beijing Institute of Genomics (BIG) and the assembly results are summarized in Table1. In the PA64S nuclear genome assembly, BIGrat has a better NG50, 19,383 vs. 28,677 bp. BIGrat closed 32.4% of the gaps left by Newbler, with a total length of 8,267,167 bp, and the improvement appears in the contig building (Additional file2: Figure S2). Moreover, in the rice organellar genome assemblies, BIGrat has also improved the output of Newbler. The chloroplast genome has a typical large repeats and there are also some large repeats in the mitochondrial genome. To look into accuracy and reliability, we compared BIGrat assemblies from rice chloroplast and mitochondrial genomes with the results described in our early publications based on data generated by using the Sanger method[4, 9]. The excellent consistency and colinearity between the results produced based on the two methods are rather obvious (Figures2 and3). We also tested BIGrat on several bacterial genome projects. For instance, for Acinetobacter baumannii MDR-ZJ06, we filled 12% more gaps (32,715 bp) with BIGrat as compared to what Newbler did. Because of the variable repeat contents of eukaryotic genomes, the effectiveness of BIGrat’s sequence assembly is rather different as we showed in the four representative genomes.
BIGrat separates repeat regions in the reference sequence, iteratively fills the gaps caused by the repeats, and assembles the sequence to completion at the end. The main parameter setting is the gap size that is the sum of reassembled repeat regions. We test this parameter from 30 bp to 10,000 bp in PA64S chromosome 1. The result showed that 500 bp is an optimal gap size for BIGrat assembly (Additional file3: Figure S3). This gap size can also be determined based on the sequencing read length. Since the read lengths of the pyrosequencing platforms are ~500 bp from Roche 454 and ~200 bp from IonTorrent, most of the repeats smaller than 200 bp or 500 bp may be assembled based on sequencing reads alone. As the gap size grows, the BIGrat’s running time also increases linearly. For example, the system running times are 54 min, 102 min, and 126 min when gap sizes change from 30 bp to 500 bp and 10,000 bp, respectively.
We also implement different data coverage to evaluate BIGrat’s performance by randomly sampling different coverage from 1x to 20x, using the rice chloroplast and mitochondrial genomes as examples (Figure4). Although the Newbler results showed that increasing data coverage provided little help to improve the assembly when data coverage increased to 10x, our BIGrat assembled the genomes completely as data coverage increased; the chloroplast and mitochondrial genomes were assembled to completion at 10x and 15x coverage, respectively. The results also provide an initial estimation as to what data coverage is needed in genome re-sequencing projects for the two organellar genomes.
We illustrated an informatics tool BIGrat ( Additional file4) to improve genome assemblies for pyrosequencing-based re-sequencing projects and showed that BIGrat is an add-on tool to Newbler. BIGrat is easily to be integrated into Newbler for next-generation sequencing assembly and analysis. Because of the limitation to pyrosequencing data and Newbler software, we will update BIGrat software to improve assembly results from all sequencing platforms in next step.
Availability and requirements
Project name: BIGrat
Project home page:http://sourceforge.net/projects/bigrat/
Operating system(s): Linux Platform
Programming language: Perl
Other requirements: Newbler (version > 2.3)
License: GNU General Public License
Any restrictions to use by non-academics: -
Chen Hua L, Chun Yuan L, Chuan Yi T: A re-sequencing tool for high mismatch-tolerant short read alignment based on Burrows-Wheeler Transform. Bioinformatics and Biomedicine Workshops (BIBMW). 2010, Hsinchu, Taiwan: IEEE International Conference, 549-554.
Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.
Klein JD, Ossowski S, Schneeberger K, Weigel D, Huson DH: LOCAS–a low coverage assembly tool for resequencing projects. PLoS One. 2011, 6 (8): e23455-10.1371/journal.pone.0023455.
Tian XJ, Zheng J, Hu SN, Yu J: The rice mitochondrial genomes and their variations. Plant Physiol. 2006, 140 (2): 401-410. 10.1104/pp.105.070060.
Zhou H, Zhang T, Yu D, Pi B, Yang Q, Zhou J, Hu S, Yu Y: Genomic analysis of the multidrug-resistant Acinetobacter baumannii strain MDR-ZJ06 widely spread in China. Antimicrob Agents Chemother. 2011, 55 (10): 4506-4512. 10.1128/AAC.01134-10.
Tsai IJ, Otto TD, Berriman M: Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010, 11 (4): R41-10.1186/gb-2010-11-4-r41.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.
Zhang T, Zhang X, Hu S, Yu J: An efficient procedure for plant organellar genome assembly, based on whole genome data from the 454 GS FLX sequencing platform. Plant methods. 2011, 7 (1): 38-10.1186/1746-4811-7-38.
Tang J, Xia H, Cao M, Zhang X, Zeng W, Hu S, Tong W, Wang J, Wang J, Yu J, et al: A comparison of rice chloroplast genomes. Plant Physiol. 2004, 135 (1): 412-420. 10.1104/pp.103.031245.
The study is supported by grants from Knowledge Innovation Program of the Chinese Academy of Sciences (KSCX2-EW-R-01-04), Natural Science Foundation of China (90919024), Natural Science Foundation of China (30900831), and the National Basic Research Program (973 Program) from the Ministry of Science and Technology of the People’s Republic of China (2011CB944100).
The authors declare that they have no competing interests.
TZ has implemented the software and written the manuscript. YL, YC, XL and JY have helped design the study and draft the manuscript. All authors read and approved the final manuscript.