CANGS: a user-friendly utility for processing and analyzing 454 GS-FLX data in biodiversity studies
© Schlötterer et al; licensee BioMed Central Ltd. 2010
Received: 30 November 2009
Accepted: 11 January 2010
Published: 11 January 2010
Next generation sequencing (NGS) technologies have substantially increased the sequence output while the costs were dramatically reduced. In addition to the use in whole genome sequencing, the 454 GS-FLX platform is becoming a widely used tool for biodiversity surveys based on amplicon sequencing. In order to use NGS for biodiversity surveys, software tools are required, which perform quality control, trimming of the sequence reads, removal of PCR primers, and generation of input files for downstream analyses. A user-friendly software utility that carries out these steps is still lacking.
We developed CANGS (C leaning and A nalyzing N ext G eneration S equences) a flexible and user-friendly integrated software utility: CANGS is designed for amplicon based biodiversity surveys using the 454 sequencing platform. CANGS filters low quality sequences, removes PCR primers, filters singletons, identifies barcodes, and generates input files for downstream analyses. The downstream analyses rely either on third party software (e.g.: rarefaction analyses) or CANGS-specific scripts. The latter include modules linking 454 sequences with the name of the closest taxonomic reference retrieved from the NCBI database and the sequence divergence between them. Our software can be easily adapted to handle sequencing projects with different amplicon sizes, primer sequences, and quality thresholds, which makes this software especially useful for non-bioinformaticians.
CANGS performs PCR primer clipping, filtering of low quality sequences, links sequences to NCBI taxonomy and provides input files for common rarefaction analysis software programs. CANGS is written in Perl and runs on Mac OS X/Linux and is available at http://i122server.vu-wien.ac.at/pop/software.html
Next generation sequencing technologies have dramatically increased the sequence output at a substantially reduced cost. In addition to genome sequencing and transcriptome profiling, ultra-deep sequencing of short amplicons offers an enormous potential in clinical studies  and in studies of ecological diversity . PCR amplicons of more than 400 bp can be sequenced in a massively parallel manner which allows building a fine-grained catalog of species abundance patterns in a broad range of habitats. This increase in the amount of sequence data requires efficient software tools for processing the raw data generated by next generation sequencers.
We developed CANGS - a flexible and user-friendly utility to trim sequences, filter low quality sequences, and produce input files for further downstream analyses. CANGS can be used to assign the taxonomic grouping based on similarity with sequences from the NCBI database .
CANGS has been developed for Mac OS X but it also works on Linux and any other Unix system. CANGS can be obtained from http://i122server.vu-wien.ac.at/pop/software.html. [See additional file 1 for the source code of CANGS, additional file 2 for test dataset of CANGS and additional file 3 for the CANGS user manual]
Required programs are BLAST  for the similarity search and MAFFT  for pairwise distance calculation. MOTHUR  and Analytic Rarefaction  are needed for estimation of the number of species (OTUs), and update_blastdb.pl  is required for downloading the BLAST database on a local computer.
Schema for processing and analyzing 454 GS-FLX sequences
Figure 1 shows the way in which the CANGS utility processes 454-sequence data sets. The arrows illustrate the path of data flow. As a preparation step for CANGS, the options file CANGSOptions.txt needs to be customized. This file allows the user to specify all parameters needed for the processing of the 454 sequences. CANGS provides two layers of analysis: the Sequence Processing Layer is the first step, in which tsfs.pl trims the sequences (removal of PCR primers, adapter sequence and sample identifiers) and filters low quality sequences (sequences with Ns, singletons, and sequences with very low average quality score). The script tsfs.pl creates two high quality processed sequence data sets: 1) redundant sequences and 2) non-redundant sequences by using the user-defined parameters in the options file. The second step is the Sequence Analysis Layer in which three different programs are available to assign the newly sequenced reads to a taxonomic group (ta.pl), estimate the change in species composition among different samples (ba.pl), and to measure species richness (ra.pl).
CANGS input customization
CANGS configuration file -- CANGSOptions.txt. CANGS was designed to allow a high flexibility for the user. In the options file the user defines the parameters that will be used by all CANGS modules. This simplified customization increases the usability and integration of the utility because the multiple programs can reference a single options file. The parameters include BLAST cutoff values, quality scores, PCR primers, barcodes, size range of PCR products etc.
Sequence trimming and quality filtering
The tsfs.pl (Trim Sequences and Filter low quality Sequences) program automates the processing of raw 454 sequences.
Sample Identifier (bar code)
Forward PCR Primer
454 adapter B
The goal of tsfs.pl is to obtain the high quality reads from pooled 454 sequences by trimming the raw sequences and filtering low quality reads which is done in seven steps.
1. Removal of adapter B
based on the sequence of adapter B, as specified in the CANGSOptions.txt file, the 3'- end of each read is trimmed. It is possible to process only sequences with a perfect match to adapter B, but a pattern search that allows for imperfection in adapter B recovers more sequences.
2. Filtering sequences with ambiguities
tsfs.pl removes reads with one or more Ns (unknown bases).
3. Removal of singletons
to ameliorate the problem of sequencing errors tsfs.pl allows the user to remove very low frequency variants from the data set. Note that several data sets could be combined to minimize the removal of true low frequency sequence variants.
4. Grouping of sequences according to bar codes
tsfs.pl distinguishes different samples based on the bar codes specified in the CANGSOptions.txt file and separates them into different data sets. This step is skipped when only a single sample is processed
5. Filtering sequences according to length threshold
the tsfs.pl program removes sequence reads falling outside the size range specified in the options file.
6. Removal of PCR primers
forward and reverse PCR primers are specified in the CANGSOptions.txt file and removed from the sequence. Only sequences with perfect identity to the specified PCR primers are processed. The 454 sequencing process preferentially generates length variants in homopolymers. As homopolymers can be as short as two bases and the target sequence is frequently not known, we developed a special procedure to recognize such sequencing errors at the end of the PCR primer:
for all sequences with the same PCR primers the tsfs.pl program scans 8 bp of the target sequence immediately adjacent to the PCR primer and identifies the most frequent 8 bp motif. Next, this consensus sequence is compared for the +1, and -1 offset of each sequence. For sequences with no 454 homopolymer mutation both the +1 and -1 offset results in many mismatches, but a read with a 454 homopolymer mutation at the end of the PCR primer will be very similar to either the +1 or -1 offset. We empirically determined that filtering reads with <3 mismatches very effectively removes reads with a 454 homopolymer mutation at the transition between target sequence and PCR primer. Hence, tsfs.pl removes all reads with <3 mismatches in the +1 or -1 offset.
7. Quality filtering
CANGS averages the quality values for each base in a read. Quality values are taken from the .qual file after the values corresponding to adapter B, bar code and primer bases have been removed in step 1, step 4 and step 6, respectively. Sequence reads with a quality value lower than the threshold specified in the options file will be discarded. Note that the quality filtering may result in new singletons, which remain in the data set, as the quality filtering is the last step in the analysis.
After trimming the sequence reads, tsfs.pl creates a non-redundant sequence data set in order to reduce the computational burden for further analysis. In the non-redundant sequence data set each sequence variant is only represented once. Note that in this step indels are considered to be informative. Hence, two sequences differing only by an indel will be listed independently in the non-redundant data set. The frequency of each sequence in the non-redundant data set is included in the FASTA header. The output file contains non-redundant reads ranked based on copy number in descending order.
Number of reads eliminated at different steps of the tsfs.pl module
Order of steps
Total no. of sequences
No. of sequences considered
No. of sequences discarded
Removal of Adapter B
Filtering sequences with ambiguities
Removal of singletons
Grouping of sequences according to bar codes
Filtering sequences according to length threshold
Removal of PCR primers
The sequence analysis module of CANGS performs various downstream analyses: 1) linking 454 reads with the taxonomic description of the most similar sequence from the NCBI database by ta.pl (Taxonomy Analysis) program, 2) measuring the overlap of OTUs between samples by ba.pl (Blast Analysis) program and 3) estimating species richness by the ra.pl (Rarefaction Analysis) module.
Hence, gaps are not considered.
ra.pl. Several software packages exist for performing rarefaction analysis [7, 8, 10]. The script ra.pl (Rarefaction Analysis) program links the data processed by CANGS with two popular rarefaction analysis software packages with minimal user interference: MOTHUR  and Analytic Rarefaction . For MOTHUR ra.pl is calculating the pairwise genetic distance by using the "mafft-distance" program of MAFFT executables . The mafft-distance program takes non-redundant sequences generated by tsfs.pl as input and gives the corresponding genetic distance table as output. For the Analytic Rarefaction software, ra.pl first calculates the abundance of each sequence in the data set using BLAST, as described above. Compared to pattern matching this procedure allows to consider sequences with gaps jointly.
CANGS is a user-friendly tool for primer clipping and quality filtering of 454 sequences. CANGS is primarily designed to handle data from amplicon resequencing projects in the context of diversity studies. The tool can be downloaded at http://i122server.vu-wien.ac.at/pop/software.html.
Availability & requirements
Project name: CANGS--Cleaning and Analyzing 454 GS-FLX sequences.
Operating System: Mac OS X, Linux and any other UNIX like system
Programming language: Perl 5.8.8
Other requirements: BioPerl, BLAST, MAFFT, MOTHUR, Analytic Rarefaction.
License: GNU General Public License.
Any restrictions to use by non-academics: license needed.
We are thankful to members of the Institut für Populationsgenetik and Jens Boenigk for helpful discussion. This work was supported by FWF grants (No. P19467-B11) to CS.
- Thomas RK, Nickerson E, Simons JF, Jänne PA, Tengs T, Yuza Y, Garraway LA, LaFramboise T, Lee JC, Shah K: Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nature medicine. 2006, 12 (7): 852-855. 10.1038/nm1437.PubMedView Article
- Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML: Microbial population structures in the deep marine biosphere. Science. 2007, 318 (5847): 97-100. 10.1126/science.1146689.PubMedView Article
- NCBI (National Center for Biotechnology Information). [http://www.ncbi.nlm.nih.gov/]
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Research. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMed CentralPubMedView Article
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology. 1990, 215 (3): 403-410.PubMedView Article
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research. 2005, 33 (2): 511-518. 10.1093/nar/gki198.PubMed CentralPubMedView Article
- Schloss PD, Handelsman J: Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied and Environmental Microbiology. 2005, 71 (3): 1501-1506. 10.1128/AEM.71.3.1501-1506.2005.PubMed CentralPubMedView Article
- Analytic Rarefaction 1.4. [http://www.huntmountainsoftware.com/]
- Update_blastdb.pl. [http://www.ncbi.nlm.nih.gov/BLAST/docs/update_blastdb.pl]
- Yu Y, Breitbart M, McNairnie P, Rohwer F: FastGroupII: a web-based bioinformatics platform for analyses of large 16S rDNA libraries. BMC Bioinformatics. 2006, 7 (7): 57-10.1186/1471-2105-7-57.PubMed CentralPubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.