fastQ_brew: module for analysis, preprocessing, and reformatting of FASTQ sequence data
© The Author(s) 2017
Received: 19 January 2017
Accepted: 8 July 2017
Published: 12 July 2017
Next generation sequencing datasets are stored as FASTQ formatted files. In order to avoid downstream artefacts, it is critical to implement a robust preprocessing protocol of the FASTQ sequence in order to determine the integrity and quality of the data.
Here I describe fastQ_brew which is a package that provides a suite of methods to evaluate sequence data in FASTQ format and efficiently implements a variety of manipulations to filter sequence data by size, quality and/or sequence. fastQ_brew allows for mismatch searches to adapter sequences, left and right end trimming, removal of duplicate reads, as well as reads containing non-designated bases. fastQ_brew also returns summary statistics on the unfiltered and filtered FASTQ data, and offers FASTQ to FASTA conversion as well as FASTQ reverse complement and DNA to RNA manipulations.
fastQ_brew is open source and freely available to all users at the following webpage: https://github.com/dohalloran/fastQ_brew.
KeywordsFASTQ NGS Sequencing
FASTQ format has become the principal protocol for the exchange of DNA sequencing files . The format is composed of both a nucleotide sequence as well as an ASCII character encoded quality score for each nucleotide. Each entry is four lines, with the first line starting with a ‘@’ character followed by an identifier. The second line is the nucleotide sequence. The third line starts with a ‘+’ character and optionally followed by the same sequence identifier that was used on the first line. The fourth line lists the quality scores for each nucleotide in the second line. In order to evaluate the quality of the FASTQ dataset and to avoid downstream artefacts, it is imperative for the user to employ robust quality control and preprocessing steps prior to downstream FASTQ applications. Furthermore, FASTQ has now become widely used in additional downstream applications and pipelines, and so diverse preprocessing tools are necessary to handle various FASTQ file manipulations [2, 3]. Here, I describe fastQ_brew, which is a robust package that performs quality control, reformatting, filtering, and trimming of FASTQ formatted sequence datasets.
fastQ_brew was developed using Perl and successfully tested on Microsoft Windows 7 Enterprise ver.6.1, Linux Ubuntu 64-bit ver.16.04 LTS, and Linux Mint 18.1 Serena. fastQ_brew does not rely on any dependencies that are not currently part of the Perl Core Modules (http://perldoc.perl.org/index-modules-A.html), which makes fastQ_brew very straight forward to implement. fastQ_brew is composed of two separate packages: fastQ_brew.pm and fastQ_brew_Utilities.pm. fastQ_brew_Utilities.pm provides fastQ_brew.pm with access to various subroutines that are called to handle FASTQ manipulations and quality control. The fastQ_brew object is instantiated by calling the constructor subroutine called “new” which creates a ‘blessed’ object that begins gathering methods and properties by calling the load_fastQ_brew method. Once the object has been populated, the user can call run_fastQ_brew to begin processing the FASTQ data. Sample data are provided at the GitHub repo and directions for usage are described in the README.md file.
In the case of arguments 15–17 above, a new file will be generated in each case, whereas for all other options the user-supplied arguments will be chained together to return a single filtered file.
To evaluate more specific methods within fastQ_brew, the relationship between nucleotide position within a given read and the corresponding Phred quality score was determined (Fig. 1b). This method tested the trimming and Phred calculation methods within fastQ_brew. The Phred quality score is used as a metric to determine the quality of a given nucleotide’s identification within a read . Phred quality scores are related (logarithmically) to the base-calling error probabilities  (see equation above). The average Phred quality scores for a randomly chosen FASTQ data file after left-side trimming (-trim_l) method invocations within fastQ_brew from position 1–20 were plotted (Fig. 1b). There was a negative correlation between increasing nucleotide position and Phred quality score (R 2 = −0.99969), that is, bases closer to the beginning of each read exhibit higher Phred quality scores, as compared with nucleotides closer to the middle of the read. This observation is in keeping with previous observations on Phred quality across reads [6–8] (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). The data set used in this test was comprised of 462,664 reads with an average read length of 99 bases. The smallest read length was 25 bases and the largest was 100 bases.
To further examine the quality filtering method of fastQ_brew, FASTQ data were downloaded from the NCBI sequence read archive (SRA—https://www.ncbi.nlm.nih.gov/sra) using the sra-toolkit (https://github.com/ncbi/sra-tools). Distribution of read quality was plotted prior to filtering (blue bars) and after filtering (red bars) using fastQ_brew revealing a shift in Phred scores towards increased quality after filtering (Fig. 1c).
Finally, to compare fastQ_brew to other FASTQ filtering tools, I examined the execution time for some of the most commonly used filtering tools in trimming FASTQ data, and compared their execution speeds to that of fastQ_brew. For all analyses, the same FASTQ file was used, and in each case methods were invoked to trim 8 bases from the left and right sides of every read in the file. The following software were used: fastq_brew ver 1.0.2; Trimmomatic ver 0.36 ; NGSQCToolkit ver 2.3.3 ; Prinseq ver 0.20.4 ; seqtk (https://github.com/lh3/seqtk); Fastxtoolkit ver 0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/index.html); BBDuk ver 37.22 (http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/); ngsShoRT ver 2.2 ; and Cutadapt ver 1.9.1 (http://journal.embnet.org/index.php/embnetjournal/article/view/200). For some other software tools, this exact invocation was not possible due to limitations on the trimming method. The data from this analysis is presented in Fig. 1d. fastQ_brew compares well with other commonly employed filtering tools. The fastest tool was BBDuk which finished trimming all reads in only 1.532 s, and this was followed very closely by seqtk which completed the task in 1.99 s. By examining across these tools we can obtain some insight into how the execution speeds for fastQ_brew compares with commonly used trimming software. However, it is important to point out that each tool offers many specific adaptations and features that are not reflected in a basic trimming task, and while speed is important when dealing with very large data-sets, other features that include accessibility, documentation, ease of use, as well as applicability of options are equally important.
In summary, I here describe fastQ_brew, a very lightweight Perl package for robust analysis, preprocessing, and manipulation of FASTQ sequence data files. The main advantage of fastQ_brew is its ease of use, as the software does not rely on any modules that are not currently contained within the Perl Core. fastQ_brew is freely available on GitHub at: https://github.com/dohalloran/fastQ_brew.
I thank members of the O’Halloran lab for critical reading of the manuscript.
The author declares no competing interests.
Availability of data and materials
Project name: fastQ_brew.
Project home page: https://github.com/dohalloran/fastQ_brew.
Operating system(s): Platform independent.
Programming language: Perl.
Other requirements: none.
Any restrictions to use by non-academics: no restrictions or login requirements.
The George Washington University (GWU) Columbian College of Arts and Sciences, GWU Office of the Vice-President for Research, and the GWU Department of Biological Sciences.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38(6):1767–71.View ArticlePubMedGoogle Scholar
- Kim M, Zhang X, Ligo JG, Farnoud F, Veeravalli VV, Milenkovic O. MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinform. 2016;17:94. doi:10.1186/s12859-016-0932-x.View ArticleGoogle Scholar
- Ferraro Petrillo U, Roscigno G, Cattaneo G, Giancarlo R. FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications. Bioinformatics. 2017;33:1575–7.PubMedGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998;8(3):175–85.View ArticlePubMedGoogle Scholar
- Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8(3):186–94.View ArticlePubMedGoogle Scholar
- Patel RK, Jain M. NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS ONE. 2012;7(2):e30619.View ArticlePubMedPubMed CentralGoogle Scholar
- Schmieder R, Lim YW, Rohwer F, Edwards R. TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinform. 2010;11:341. doi:10.1186/1471-2105-11-341.View ArticleGoogle Scholar
- Cox MP, Peterson DA, Biggs PJ. SolexaQA: at-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinform. 2010;11:485. doi:10.1186/1471-2105-11-485.View ArticleGoogle Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30(15):2114–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011;27(6):863–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen C, Khaleel SS, Huang H, Wu CH. Software for pre-processing illumina next-generation sequencing short read sequences. Source Code Biol Med. 2014;9:8. doi:10.1186/1751-0473-9-8.View ArticlePubMedPubMed CentralGoogle Scholar