AdapterRemoval: easy cleaning of next-generation sequencing reads

Background With the advent of next-generation sequencing there is an increased demand for tools to pre-process and handle the vast amounts of data generated. One recurring problem is adapter contamination in the reads, i.e. the partial or complete sequencing of adapter sequences. These adapter sequences have to be removed as they can hinder correct mapping of the reads and influence SNP calling and other downstream analyses. Findings We present a tool called AdapterRemoval which is able to pre-process both single and paired-end data. The program locates and removes adapter residues from the reads, it is able to combine paired reads if they overlap, and it can optionally trim low-quality nucleotides. Furthermore, it can look for adapter sequence in both the 5’ and 3’ ends of the reads. This is a flexible tool that can be tuned to accommodate different experimental settings and sequencing platforms producing FASTQ files. AdapterRemoval is shown to be good at trimming adapters from both single-end and paired-end data. Conclusions AdapterRemoval is a comprehensive tool for analyzing next-generation sequencing data. It exhibits good performance both in terms of sensitivity and specificity. AdapterRemoval has already been used in various large projects and it is possible to extend it further to accommodate application-specific biases in the data.

Keywords: Next-generation sequencing, Adapter trimming, Data pre-processing, Sequence alignment, Paired-end reads, Single-end reads

Background
With the growing use of next-generation sequencing techniques in research groups all around the world there is also a growing need for tools that can help in the downstream analyses of the vast amounts of sequencing data produced. Moreover, as sequencing costs continue to drop [1], more and more groups can afford next-generation sequencing which further increases the need for efficient and accurate pre-processing of sequencing data. Consequently many groups have to deal with the same type of problems which leads to the in-house development of tools that already exist.
One of the problems encountered in many experiments -especially as read lengths keep expanding -is the sequencing of adapter fragments. If the read length, L R , is longer than the insert size, L I , then the read produced by the sequencing machine will include L A = L R − L I nucleotides from the adapter sequence. Depending on the library building protocol used, adapter fragments will be present in the 3' end of the read and possibly also the 5' end. If these fragments -denoted adapter contamination in this paper -are not removed correctly they can lead to either missed alignments because the sequenced construct does not match the genome or, if the read is mapped to the genome, a misleading increase in the number of mismatches in the end of the mapping. These reads containing adapter contamination can then lead to wrong genotyping and SNP calls further downstream in the analyses. Mismatches in the 5' end due to adapter contamination has an even higher probability of wrongfully discarding a genuine match since most mapping tools depend on a high-similarity seed region in the 5' end of the reads (e.g. the default behaviour of Bowtie [2], BWA [3], SOAP [4], and SOAP2 [5] is to allow no more than 2 mismatches in the seed region). http://www.biomedcentral.com/1756-0500/5/337 The problem becomes more dramatic the shorter the molecule of interest is -for instance when sequencing microRNAs, or within the field of ancient DNAalthough the problem is not isolated to these fields of research. It is therefore of great importance to clean the reads by removing these subsequences of non-genomic origin before mapping to a reference genome or performing de novo assembly of the reads. Since this is a general problem many different programs exist that try to solve it, each exhibiting their own strengths and weaknesses as summarized in Table 1. These methods vary in which features they offer when trimming adapters (e.g. handling single-end or paired-end data, finding adapters in the 5' or 3' end of reads, searching for multiple different adapters), and in what additional analyses can be performed such as trimming low-quality nucleotides or sorting the reads based on multiplexing barcodes.
We present a standalone tool, AdapterRemoval, that efficiently solves most of these problems simultaneously without the need for calling multiple different programs. AdapterRemoval can find adapters in both the 5' and 3' end of the reads, it can handle both singleend and paired-end data, it can remove low-quality regions and trim Ns from the reads, and it can collapse overlapping paired-end reads. AdapterRemoval has been independently developed in our group where it has been used (although as an unnamed part of the pipeline) in a number of large sequencing projects mainly focused on ancient DNA [28][29][30]. The tool has therefore been used frequently over the years and is still an integral part of the work at the Centre of Excellence in GeoGenetics in Copenhagen. AdapterRemoval has been updated and extended based on feedback and requests from the users to solve various problems encountered. It is a versatile tool that is easy to use on any UNIX-based platform.

Methods
AdapterRemoval uses a variation of the Needleman-Wunsch algorithm [31] that has been altered to perform ungapped semiglobal alignment looking for matches between the 3' end of the read and the 5' end of the adapter sequence. The specificities of the algorithm depends on the data (single-or paired-end) and on other settings as described in the following. The overall functionality of the program is illustrated in Figure 1.
If the insert being sequenced is shorter than the read length, the read will include part of the adapter sequence in the 3' end. In case of single-end reads AdapterRemoval performs alignment between the reads and the expected adapter sequence. When processing single-end The table summarizes the features of different programs that are aimed at removing adapter sequences from next-generation sequencing reads. For each method, the table shows if it is able to i) identify adapters in the 5' end of reads, ii) identify adapters in the 3' end of reads, iii) process single-end reads, iv) process paired-end reads, v) collapse overlapping pairs, vi) search for multiple different adapters, vii) trim subsequences of Ns, viii) trim low-quality nucleotides, ix) sort multiplexed reads based on barcodes. The last feature is not as such related to adapter trimming but since it occurs often in these programs it is included. The table also lists references for each program. Notes: 1) If chosen, reads with one or more Ns are discarded (i.e. Ns are not trimmed). 2) If chosen, low-quality reads are discarded (i.e. low-quality nucleotides are not trimmed). 3) Discards remaining read if the other read in a pair is removed due to trimming. 4) The aim of this program is slightly different as it compares the reads to a library of sequences and checks for significant overlap. http://www.biomedcentral.com/1756-0500/5/337 reads the identification of the adapter fragment becomes increasingly difficult the shorter it is. When analyzing paired-end data much more information is available: If we have adapter contamination it will be symmetrical in the two reads (if no indels occur, see Figure 1E) and the program can identify precisely even a single nucleotide from the adapters. The two reads (with one being reverse-complemented) will be identical in the overlapping region (i.e. the genomic insert) and the 5' and 3' ends of the reads, respectively, will match the adapters. Even when allowing for mismatches this makes the procedure extremely sensitive to adapter contamination in these cases.
The allowed mismatch rate can be set by the user but by default the program demands perfect match for alignments up to 5 nucleotides, allows 1 mismatch for alignments up to 10 nucleotides, and allows a fraction (0.15 for paired-end and 0.33 for single-end) of the alignment length to be mismatches for longer alignments. The program employs a simple yet effective scoring scheme: 1 for matches, -1 for mismatches, 0 for alignments to Ns. The alignment chosen is the best one in terms of total score where the number of mismatches is in the allowed range.
As gaps are much less common than mismatches in Illumina data [32] we do not include gapped alignments, and since we only calculate alignments between the 3' end of the read and the 5' end of the adapter we only need the top half of the dynamic programming matrix above the main diagonal ( Figure 2, panel 3). Observations have shown that reads sometimes miss a few bases in the 5' end. This can lead to adapter contamination being missed as the alignment is confined to the top half of the matrix thereby not aligning the two reads properly (Figure 2, panel 1 and  2). To solve this the alignments can be extended slightly which effectively shifts one read towards the 3' end by S nucleotides relative to the other read (default is S = 2). This creates an overhang in the 5' end of the adapter where the first few nucleotides are ignored since they are not in the sequenced read. The cost of this is that S additional subdiagonals have to be calculated in the matrix to include these alignments (Figure 2, panel 3).
AdapterRemoval allows overlapping pairs of reads to be collapsed into a single read whether the two contain adapter contamination or not (see Figure 1). This idea has also been pursued independently in the program FLASH published recently [33]. If the insert length, L I , is longer than the read length, L R , but shorter than 2 · L R , then we have no adapter contamination but the two reads overlap in their 3' ends. In that case the two reads can be combined into one read and the qualities for the overlap can be reestimated based on the two quality strings. If the two reads in a pair contain adapter sequence the remaining overlapping fragments of genomic origin will be from the same original sequence and can likewise be collapsed into one read and the qualities re-estimated.
To collapse two reads into one, AdapterRemoval treats the quality scores for the overlapping region as a position specific scoring matrix (PSSM). For each position in the overlap, we have a nucleotide and a quality score from both reads. The quality score, Q, can be converted to an error probability, P e = 10 −Q/10 . This gives the probability for the nucleotide in the read, P 1 = 1 − P e , and a probability for each of the remaining three nucleotides, P 2 = P e /3. These probabilities are combined for the two reads to give a single re-estimated probability distribution for the overlapping region. Finally, the most likely nucleotide sequence is chosen based on the PSSM, and the probabilities are translated back into re-estimated Phred quality scores [34]. If the user decides to do so AdapterRemoval will detect these cases and output the new nucleotide sequence and re-estimated quality scores. The user can http://www.biomedcentral.com/1756-0500/5/337

Figure 2
The need for shifting the alignment due to missing nucleotides. If the read is missing a few nucleotides in the 5' end, the proper alignment will not be recoverable if the procedure stops at the first position. As shown in 1), this leads to multiple mismatches and possibly missed adapter contamination. If the alignment is shifted by S nucleotides as shown in 2), the correct alignment can be found. The dynamic programming matrix in 3) shows which entries in the matrix leads to the two solutions shown here. The light grey part is the upper half of the matrix that is calculated by default; the two dark grey entries illustrate the two alignments shown in 1) and 2). specify how long the overlap has to be for two reads to be combined (default is 11 nucleotides as in [35]).
It is well-known that the quality of a read is lower in the ends [32], with elevated error rates at both the 5' andin particular -3' end of the reads. AdapterRemoval has therefore been designed to deal with this in two different ways: It is possible to trim consecutive stretches of the ambiguity character N from both ends of the reads. Since the presence of Ns can make mapping hard due to an increased number of spurious hits it is normally important to remove these uninformative positions. Furthermore, AdapterRemoval can trim the reads based on the quality scores by removing consecutive stretches of nucleotides from both ends of the reads where the quality scores do not exceed a given threshold (default is to trim 2 or lower). These two trimming options can of course be used either alone or in combination. Related to this the program also has the option of discarding reads that contain too many Ns even after trimming. How many Ns to allow is defined by the user.
Using AdapterRemoval it is also possible to find and remove adapters from the 5' end of the reads. However, due to the difference in the experimental setup this is done in a different and more strict manner than in the 3' end. First, at most one mismatch is tolerated in the aligned part of the read and the adapter. Second, it is expected that the 5' adapter sequence is present in almost full length. Hence, the trimming only allows for the adapter to have slipped a few positions corresponding to the first few nucleotides of the adapter not being present in the read. This parameter can be defined by the user but the default is up to two nucleotides, and these slipped positions do not add to the number of mismatches.
When AdapterRemoval is used it reads either one or two FASTQ files and depending on the settings the output is written to a number of files. In the single-end case, one file contains the trimmed reads, and another contains discarded reads (due to e.g. length or quality control). In the paired-end case the trimmed pairs are written to two new files that keep the ordering of the pairs intact. If one mate in a pair is discarded the remaining read is written to a singleton file in order to keep as much useful data as possible. These reads can then be treated as single-end reads. All discarded reads are written to a separate file. Adapter-Removal can work with compressed files using pipes as described in the user manual, the user can specify the quality base used (either Phred + 33 (default) or Phred + 64), and the user can specify the minimum length of a read after trimming (default is 15 nucleotides).
A simulated test set was created based on a modern paired-end dataset from Yersinia pestis (SRA accession SRX028780). From this dataset, 1,000,000 read pairs were extracted with each read being 75 nucleotides long. For each pair, a simulated insert length between 0 and 200 nucleotides was sampled. If the insert length was 150 nucleotides or more, the reads do not overlap and no changes were made to the data. If the insert length was between 149 and 75 nucleotides, the two reads overlap but we have no adapter contamination. In this case, a subsequence based on the insert length was taken from read 1, reverse-complemented and inserted into read 2, and then the new part of read 2 was randomly mutated based on the quality scores to simulate read errors. If the insert length was shorter than 75 nucleotides, a subsequence of read 1 was copied to read 2 as above, and furthermore adapter sequence was added to both reads from two different adapters. Finally, the new sequences were mutated based on the quality scores. This yields 1,000,000 read pairs with known adapter contamination in 373,963 cases and no adapter in the remaining 626,037 pairs.

Test results
The performance of AdapterRemoval was tested on the simulated paired-end dataset described above and compared to another program that is able to handle both single-end and paired-end data, Trimmomatic version http://www.biomedcentral.com/1756-0500/5/337 0.20 [27]. Trimmomatic was run as described on the website but changing the minimum read length to 15 after trimming to make it comparable to AdapterRemoval. For single-end analysis both programs were run on just the first read from each pair. In this test the programs were only used for trimming adapters and no filtering based on Ns or low-quality nucleotides was used.
After trimming, the output from each program was analyzed and five categories of cases were recorded: 1. How often did the program trim a read that did not contain adapter? 2. How often did the program trim only the adapter sequence? 3. How often did the program trim more than the adapter sequence? 4. How often did the program trim less than the adapter sequence? 5. How often did the program not trim anything from a read with adapter contamination?
The false positive rate is the sum of cases 1 and 3, i.e. the cases where the program trimmed nucleotides that were not from the adapter (even if the adapter sequence was also removed). The true positive rate is case 2 where only the adapter is removed. The sum of cases 4 and 5 is the false negative rate since, in both cases, the program failed to remove the full adapter sequence. The number of reads containing no adapter and not being trimmed at all is the true negative rate. From these numbers, positive predictive value, sensitivity, specificity and Matthew's correlation coefficient were calculated for both programs: The results are summarized in Table 2 together with run times and maximum memory usage as reported by the UNIX time command. AdapterRemoval runs slower than Trimmomatic but uses less memory.
Trimmomatic performs equally good on both singleend and paired-end data performing the exact same trimming. It has perfect specificity and positive predictive value at a modest sensitivity, yielding a MCC of 0.48. However, when looking at the results it is clear that in the paired-end case Trimmomatic trims fewer of the reverse reads (86,043 reads are trimmed exactly) thus missing more adapters in those cases (287,920). It is not clear why it does not trim both members of a pair the same. For this test, the best numbers were used in the calculations, and the program was run with all combinations of adapter sequences (both original and reverse-complemented) to make sure that the correct sequences were tested. Statistics for AdapterRemoval and Trimmomatic tested on both single-end and paired-end data. The test set contains 1,000,000 read pairs of which 373,963 contain adapter fragments and the remaining 626,037 do not. The paired-end test was performed on the 1,000,000 pairs, and the single-end test was performed on the first read from each pair. The reported numbers are: Run time (seconds); max memory usage (kilobytes); number of reads containing no adapter that were trimmed; number of reads with adapter where just the adapter was removed; number of reads with adapter where more than the adapter was trimmed; number of trimmed reads with adapter where the full adapter was not removed; number of reads with adapter where no trimming was performed; positive predictive value; sensitivity; specificity; Matthew's correlation coefficient. For "Trimmed, no adapter", the percentage of the 626,037 reads with no adapter that were trimmed is shown. For the following four rows, the percentages are of the 373,963 read pairs with adapter, and these numbers add up to 373,963 and 100%, respectively. http://www.biomedcentral.com/1756-0500/5/337 In the single-end case, AdapterRemoval shows good performance with lower specificity and positive predictive value than Trimmomatic but also a much higher sensitivity and, hence, MCC of 0.71. AdapterRemoval trims many more adapters correctly but also wrongly trims more reads without adapters.
As expected, all the measures of accuracy go up when using AdapterRemoval on paired-end data compared to single-end data. The extra information available in having two reads that align in case of adapter contamination makes AdapterRemoval much better at removing only true adapter residues from the reads. This is especially clear from the false positive rate that drops by almost a factor 1000. The MCC is increased from 0.71 to 0.94.
As mentioned above, Trimmomatic was run using the the default parameters given on the website and only changing parameters to make it directly comparable to AdapterRemoval. It is likely that Trimmomatic would perform better if the parameters were tweaked which has not been done in this experiment. However, based on this test AdapterRemoval shows good performance on all measures. A test where both programs also trimmed Ns and low-quality nucleotides showed the same overall results as above. Future work on AdapterRemoval should focus on improving the run time and including an option for trimming multiple adapters simultaneously.