FGAP: an automated gap closing tool
- Vitor C Piro†1,
- Helisson Faoro2,
- Vinicius A Weiss2,
- Maria BR Steffens2,
- Fabio O Pedrosa2,
- Emanuel M Souza2 and
- Roberto T Raittz†1Email author
© Piro et al.; licensee BioMed Central Ltd. 2014
Received: 27 February 2014
Accepted: 9 June 2014
Published: 18 June 2014
The fast reduction of prices of DNA sequencing allowed rapid accumulation of genome data. However, the process of obtaining complete genome sequences is still very time consuming and labor demanding. In addition, data produced from various sequencing technologies or alternative assemblies remain underexplored to improve assembly of incomplete genome sequences.
We have developed FGAP, a tool for closing gaps of draft genome sequences that takes advantage of different datasets. FGAP uses BLAST to align multiple contigs against a draft genome assembly aiming to find sequences that overlap gaps. The algorithm selects the best sequence to fill and eliminate the gap.
FGAP reduced the number of gaps by 78% in an E. coli draft genome assembly using two different sequencing technologies, Illumina and 454. Using PacBio long reads, 98% of gaps were solved. In human chromosome 14 assemblies, FGAP reduced the number of gaps by 35%. All the inserted sequences were validated with a reference genome using QUAST. The source code and a web tool are available at http://www.bioinfo.ufpr.br/fgap/.
Low-cost and high-throughput sequencing technologies have increased exponentially the amount of sequence data available. The development of these technologies combined with advances in computer algorithms provided a large number of sequenced genomes. However, more than a third of these genome sequences available in public databases remain as drafts and many other projects are still incomplete  because of limitations of short read second-generation sequencing and assembly processes. Sequencing errors, regions of high complexity and repeated sequences are the most common issues. The single molecule third-generation sequencing technology  solved some of these limitations with longer reads, but brought in others such as high error rate and higher cost. Thus, there is still a dependence on second-generation sequencing platforms. The vast majority of genomes available today were sequenced using short-reads and their assemblies can still be improved.
Developments of the finishing process, which comprise error correction, scaffolding and gap closing, did not follow the speed of sequencing technologies. One strategy to reduce the number of gaps is to obtain data from different sequencing technologies, aiming to reduce errors, compensate bias and improve quality and completeness of the genome sequence . Another approach is to obtain alternative assemblies using the same raw data, but with different assemblers and parameters . These strategies usually generate many datasets, which can be combined to improve the genome. Some methods such as GapCloser (a module of SOAPdenovo2 ), GapFiller  (not to be confused with ), IMAGE , FinIS  and CloG  were designed to reduce the gaps in genome assemblies using different approaches.
We propose an open-source software called FGAP, that aims to improve genome sequences by merging alternative assemblies or incorporating alternative data, analyzing the gap region and indicating the best sequence to close the gap.
FGAP searches for sequences that overlap contig ends in proposed scaffolds. It needs at least two Fasta files to run: the draft genome assembly and one or more contig datasets (alternative assemblies, long reads, contigs). The algorithm aligns contig ends from the draft assembly against datasets, selects the alignments with given parameters, and chooses the best sequence to eliminate the gap.
E. coli assemblies
Illumina(pe) + 454(se) [Draft]
Human chromosome assemblies
To validate closed gaps, we compared the sequence inserted from all closed gaps and their flanking regions against the reference genomes of E. coli K-12 [GenBank:NC_000913.2] with 4,641,652 bp and human chromosome 14 [GenBank:NC_000014.9] with 107,043,718 bp. Gaps are considered correctly closed when: 1) flanking regions align at least 40% of their length (based on the contig end length choosen for FGAP) with the reference, 2) the identity of the flanking regions and the inserted sequences are higher than a threshold (the same defined for FGAP), 3) the identity is greater than it was before gap closing (flanking regions without insertion). The NUCmer algorithm  was used to perform this validation.
We compared the results of FGAP with three standalone tools for gap closing: GapCloser , GapFiller , and IMAGE . These programs rely on the identification of paired-end or mate-pair reads that map at contig ends and extend them by performing local assemblies to close gaps. All available libraries for each organism (1 for E. coli and 3 for human chromosome 14) were used as input to these tools. Two other approaches could not be tested: the FinIS  software relies on the graph generated by the assembler and does not support SOAPdenovo2  assemblies, whereas the CloG  approach has not been implemented. Details of each program are in Additional file 1.
FGAP was developed in Matlab/Octave and can run indistinctly in both languages via source-code. It also runs in compiled code (depends on MCR) or through the World Wide Web (available at ) without requiring any license. It uses BLAST+ 2.2.28 or higher. The algorithm runs in multiple rounds, necessary to prevent overlapping between gaps close to each other. This prevents modifications in the query sequence of the neighbor gap. The output consists of one Fasta and one log file per round, and a final statistics file. The log file contains the alignment information for both sides of each gap. The Fasta file contains the new sequence with the gap sequence reported in the log file. Changes are incremental in the output Fasta files.
The number of gaps of the E. coli str. K-12 substr. MG1655 in the ordered scaffolds of the draft genome sequence dropped from 123 to 26, thus reducing the unknown regions by 78%. Furthermore, 96% (94/97) of the newly inserted sequences were in agreement with the reference E. coli K-12 genome sequence. Using only PacBio as dataset with the same parameters, 121 out of 123 gaps were closed and all of them were validated with the reference. Assemblies of the human chromosome 14 derived from two different programs were used to evaluate the performance of FGAP in a more complex genome. FGAP reduced the number of gaps by 35% (1527 gaps closed out of 4307) in this scenario.
Software comparison in E. coli assembly
FGAP + Long*
Nº of gaps
Nº contigs (≥ 1000 bp)
Complete + partial genes
4325 + 44
4377 + 34
4388 + 27
4375 + 35
4367 + 35
4389 + 67
Inserted bases (bp)
2 m 55 s
1 m 19 s
19 m 23 s
2 h 46 m 29 s
Software comparison in human chromosome 14 assembly
Nº of gaps
Nº contigs (≥ 1000 bp)
Complete + partial genes
1064 + 497
1141 + 423
1121 + 448
1093 + 468
1078 + 488
Inserted bases (bp)
3 h 11 m
1 h 10 m
8 h 09 m
50 h 45 m
FGAP and GapCloser performed similiarly when the human chromosome 14 assemblies were used (Table 4). However, FGAP was better in terms of local misassemblies, N50 size and identified genes. In this evaluation, GapCloser achieved the lowest running time but had the highest number of local misassemblies. GapFiller and IMAGE had the lowest number of gaps closed. Again, IMAGE performed poorly under our conditions, taking more then 50 hours to run.
In both cases the number of inserted bases by each software varied, probably due to differences in extension of gaps closed by each program, and it was also influenced by errors introduced by the different methods. Particularly, the IMAGE tool increased the genome size substantially more than the others, and also had the highest error rate. All comparisons were made with the scaffolds broken down into contigs.
We developed a new software for gap filling that can be helpful for genome sequence finishing. FGAP automatically integrates various datasets into a draft genome, an approach that differs from the extension of contig ends based on paired read information. The flexibility of input data is beneficial, since it can use different sequencing technologies or different assemblies and does not rely on paired-end or mate-pair data. Programs such as GapCloser, which was projected to work with Illumina data only, or FinIS, which requires a specific assembler, have more restricted use.
Compared to available tools, FGAP is the only one with a self-explained, human readable and complete output that shows every sequence inserted in each gap, their relative position and alignment. This output can be useful for further analysis. Furthermore, it was the fastest program tested on small genome sequences and can run in a notebook. FGAP is the only tool tested that has support for long reads from third generation sequencing. It is also available on the web, which is an even easier way to access the program. Only FGAP, GapCloser and IMAGE are freely available.
We show that FGAP is an efficient tool to find regions to fill gaps of draft genome sequences. The tool demands low computational resources, the results can be easily analyzed by the output generated, and it can be used for small or large genome assemblies. FGAP can effectively reduce the effort to improve draft genome sequences in few steps, minimizing the number of unknown regions for human evaluation and reducing the need to obtain new data. In addition, FGAP has been successfully used to close gaps of draft sequences of several bacterial and fungal genome projects.
Availability and requirements
Project name: FGAP;
Project home page:http://sourceforge.net/p/fgap/;
Operating system(s): Platform independent;
Programming language: Matlab (R2012a) or Octave (3.6.2);
Other requirements: BLAST+ 2.2.28 or higher (blastn and makeblastdb) and MCR - Matlab Compiler Runtime v7.17 (only for compiled version);
License: The MIT License (MIT)
We thank R.A. Vialle, N.A.R. Coimbra, T.M. Batista, D. Guizelini for technical assistance, R.B. da Silva for review and Dr. M. G. Yates for kindly correct the paper.
National Institute of Science and Technologies of Biological Nitrogen Fixation, Fundação Araucária, CAPES, CNPq.
- Pagani I, Liolios K, Jansson J, Chen I-MA, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC:The Genomes OnLine Database (GOLD) v.4. Nucleic Acids Res. 2012, 40 (Database issue): 571-579.View ArticleGoogle Scholar
- Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, Phillippy AM:Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012, 30 (7): 693-700. 10.1038/nbt.2280.PubMedPubMed CentralView ArticleGoogle Scholar
- Bashir A, Klammer AA, Robins WP, Chin C-S, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, Sebra R, Sorenson J, Bullard J, Yen J, Valdovino M, Mollova E, Luong K, Lin S, Lamay B, Joshi A, Rowe L, Frace M, Tarr CL, Turnsek M, Davis BM, Kasarskis A, Mekalanos JJ, Waldor MK, Schadt EE:A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol. 2012, 30 (7): 701-707. 10.1038/nbt.2288.PubMedPubMed CentralView ArticleGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA:GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012, 22 (3): 557-567. 10.1101/gr.131383.111.PubMedPubMed CentralView ArticleGoogle Scholar
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu S-M, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T-W, Wang J:SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012, 1 (1): 18-10.1186/2047-217X-1-18.PubMedPubMed CentralView ArticleGoogle Scholar
- Boetzer M, Pirovano W:Toward almost closed genomes with GapFiller. Genome Biol. 2012, 13 (6): 56-10.1186/gb-2012-13-6-r56.View ArticleGoogle Scholar
- Nadalin F, Vezzi F, Policriti A:GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics. 2012, 13 Suppl 1 (Suppl 14): 8-View ArticleGoogle Scholar
- Tsai IJ, Otto TD, Berriman M:Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010, 11 (4): 41-10.1186/gb-2010-11-4-r41.View ArticleGoogle Scholar
- Gao S, Bertrand D, Nagarajan N:FinIS: Improved in silico finishing using an exact quadratic programming formulation. Lecture Notes Comput Sci. 2012, 7534: 314-325. 10.1007/978-3-642-33122-0_25.View ArticleGoogle Scholar
- Yang X, Medvin D, Narasimhan G, Yoder-Himes D, Lory S:CloG: A pipeline for closing gaps in a draft assembly using short reads. 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS). 2011, Washington, DC, USA: IEEE Computer Societ, 202-207.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ:Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB:High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Nat Acad Sci USA. 2011, 108 (4): 1513-1518. 10.1073/pnas.1017351108.PubMedPubMed CentralView ArticleGoogle Scholar
- Miller JR, Delcher AL, Koren S, Venter E, Walenz BP, Brownley A, Johnson J, Li K, Mobarry C, Sutton G:Aggressive assembly of pyrosequencing reads with mates. Bioinformatics (Oxford, England). 2008, 24 (24): 2818-2824. 10.1093/bioinformatics/btn548.View ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL:Versatile and open software for comparing large genomes. Genome Biol. 2004, 5 (2): 12-10.1186/gb-2004-5-2-r12.View ArticleGoogle Scholar
- Piro VC:FGAP an automated gap closing tool. [http://www.bioinfo.ufpr.br/fgap],
- Gurevich A, Saveliev V, Vyahhi N, Tesler G:QUAST: Quality assessment tool for genome assemblies. Bioinformatics (Oxford, England). 2013, 29 (8): 1072-1075. 10.1093/bioinformatics/btt086.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.