Open Access

CANGS: a user-friendly utility for processing and analyzing 454 GS-FLX data in biodiversity studies

  • Ram Vinay Pandey1,
  • Viola Nolte1 and
  • Christian Schlötterer1Email author
BMC Research Notes20103:3

DOI: 10.1186/1756-0500-3-3

Received: 30 November 2009

Accepted: 11 January 2010

Published: 11 January 2010



Next generation sequencing (NGS) technologies have substantially increased the sequence output while the costs were dramatically reduced. In addition to the use in whole genome sequencing, the 454 GS-FLX platform is becoming a widely used tool for biodiversity surveys based on amplicon sequencing. In order to use NGS for biodiversity surveys, software tools are required, which perform quality control, trimming of the sequence reads, removal of PCR primers, and generation of input files for downstream analyses. A user-friendly software utility that carries out these steps is still lacking.


We developed CANGS (C leaning and A nalyzing N ext G eneration S equences) a flexible and user-friendly integrated software utility: CANGS is designed for amplicon based biodiversity surveys using the 454 sequencing platform. CANGS filters low quality sequences, removes PCR primers, filters singletons, identifies barcodes, and generates input files for downstream analyses. The downstream analyses rely either on third party software (e.g.: rarefaction analyses) or CANGS-specific scripts. The latter include modules linking 454 sequences with the name of the closest taxonomic reference retrieved from the NCBI database and the sequence divergence between them. Our software can be easily adapted to handle sequencing projects with different amplicon sizes, primer sequences, and quality thresholds, which makes this software especially useful for non-bioinformaticians.


CANGS performs PCR primer clipping, filtering of low quality sequences, links sequences to NCBI taxonomy and provides input files for common rarefaction analysis software programs. CANGS is written in Perl and runs on Mac OS X/Linux and is available at


Next generation sequencing technologies have dramatically increased the sequence output at a substantially reduced cost. In addition to genome sequencing and transcriptome profiling, ultra-deep sequencing of short amplicons offers an enormous potential in clinical studies [1] and in studies of ecological diversity [2]. PCR amplicons of more than 400 bp can be sequenced in a massively parallel manner which allows building a fine-grained catalog of species abundance patterns in a broad range of habitats. This increase in the amount of sequence data requires efficient software tools for processing the raw data generated by next generation sequencers.

We developed CANGS - a flexible and user-friendly utility to trim sequences, filter low quality sequences, and produce input files for further downstream analyses. CANGS can be used to assign the taxonomic grouping based on similarity with sequences from the NCBI database [3].

CANGS has been developed for Mac OS X but it also works on Linux and any other Unix system. CANGS can be obtained from [See additional file 1 for the source code of CANGS, additional file 2 for test dataset of CANGS and additional file 3 for the CANGS user manual]


CANGS software utility is written in PERL 5.8 using BioPerl [4]. The main workflow is depicted in Figure 1. The entire analysis is guided by a configuration file - CANGSOptions.txt and four PERL programs (,, and CANGS can be run on Mac OS, Linux and other Unix like systems.
Figure 1

The architecture of CANGS utility. The four major components of the CANGS are (Trimming Sequences and Filtering Sequences), Analysis), (Blast Analysis) and (Rarefaction Analysis). All these four components are connected to a single configuration file "CANGSOptions.txt" to take inputs.

Required programs are BLAST [5] for the similarity search and MAFFT [6] for pairwise distance calculation. MOTHUR [7] and Analytic Rarefaction [8] are needed for estimation of the number of species (OTUs), and [9] is required for downloading the BLAST database on a local computer.


Schema for processing and analyzing 454 GS-FLX sequences

Figure 1 shows the way in which the CANGS utility processes 454-sequence data sets. The arrows illustrate the path of data flow. As a preparation step for CANGS, the options file CANGSOptions.txt needs to be customized. This file allows the user to specify all parameters needed for the processing of the 454 sequences. CANGS provides two layers of analysis: the Sequence Processing Layer is the first step, in which trims the sequences (removal of PCR primers, adapter sequence and sample identifiers) and filters low quality sequences (sequences with Ns, singletons, and sequences with very low average quality score). The script creates two high quality processed sequence data sets: 1) redundant sequences and 2) non-redundant sequences by using the user-defined parameters in the options file. The second step is the Sequence Analysis Layer in which three different programs are available to assign the newly sequenced reads to a taxonomic group (, estimate the change in species composition among different samples (, and to measure species richness (

CANGS components

CANGS input customization

CANGS configuration file -- CANGSOptions.txt. CANGS was designed to allow a high flexibility for the user. In the options file the user defines the parameters that will be used by all CANGS modules. This simplified customization increases the usability and integration of the utility because the multiple programs can reference a single options file. The parameters include BLAST cutoff values, quality scores, PCR primers, barcodes, size range of PCR products etc.

Sequence trimming and quality filtering

The (Trim Sequences and Filter low quality Sequences) program automates the processing of raw 454 sequences.

A typical 454 read consists of (5'- to 3'-end):
  1. 1.

    Sample Identifier (bar code)

  2. 2.

    Forward PCR Primer

  3. 3.

    Target Sequence

  4. 4.

    Reverse Primer

  5. 5.

    454 adapter B


The goal of is to obtain the high quality reads from pooled 454 sequences by trimming the raw sequences and filtering low quality reads which is done in seven steps.

1. Removal of adapter B

based on the sequence of adapter B, as specified in the CANGSOptions.txt file, the 3'- end of each read is trimmed. It is possible to process only sequences with a perfect match to adapter B, but a pattern search that allows for imperfection in adapter B recovers more sequences.

2. Filtering sequences with ambiguities removes reads with one or more Ns (unknown bases).

3. Removal of singletons

to ameliorate the problem of sequencing errors allows the user to remove very low frequency variants from the data set. Note that several data sets could be combined to minimize the removal of true low frequency sequence variants.

4. Grouping of sequences according to bar codes distinguishes different samples based on the bar codes specified in the CANGSOptions.txt file and separates them into different data sets. This step is skipped when only a single sample is processed

5. Filtering sequences according to length threshold

the program removes sequence reads falling outside the size range specified in the options file.

6. Removal of PCR primers

forward and reverse PCR primers are specified in the CANGSOptions.txt file and removed from the sequence. Only sequences with perfect identity to the specified PCR primers are processed. The 454 sequencing process preferentially generates length variants in homopolymers. As homopolymers can be as short as two bases and the target sequence is frequently not known, we developed a special procedure to recognize such sequencing errors at the end of the PCR primer:

for all sequences with the same PCR primers the program scans 8 bp of the target sequence immediately adjacent to the PCR primer and identifies the most frequent 8 bp motif. Next, this consensus sequence is compared for the +1, and -1 offset of each sequence. For sequences with no 454 homopolymer mutation both the +1 and -1 offset results in many mismatches, but a read with a 454 homopolymer mutation at the end of the PCR primer will be very similar to either the +1 or -1 offset. We empirically determined that filtering reads with <3 mismatches very effectively removes reads with a 454 homopolymer mutation at the transition between target sequence and PCR primer. Hence, removes all reads with <3 mismatches in the +1 or -1 offset.

7. Quality filtering

CANGS averages the quality values for each base in a read. Quality values are taken from the .qual file after the values corresponding to adapter B, bar code and primer bases have been removed in step 1, step 4 and step 6, respectively. Sequence reads with a quality value lower than the threshold specified in the options file will be discarded. Note that the quality filtering may result in new singletons, which remain in the data set, as the quality filtering is the last step in the analysis.

After trimming the sequence reads, creates a non-redundant sequence data set in order to reduce the computational burden for further analysis. In the non-redundant sequence data set each sequence variant is only represented once. Note that in this step indels are considered to be informative. Hence, two sequences differing only by an indel will be listed independently in the non-redundant data set. The frequency of each sequence in the non-redundant data set is included in the FASTA header. The output file contains non-redundant reads ranked based on copy number in descending order.

Figure 2B shows an example of a FASTA header for a non-redundant sequence.
Figure 2

An example output of the sequence trimming and quality filtering module in CANGS. An example of the final output of Sequence trimming and quality filtering module ( (A) example of redundant sequence data set and (B) example of the non-redundant sequence data set.

To demonstrate the utility of CANGS, we used 454 sequences, which have been deposited in the NCBI database [NCBI: SRA008706.2]. This data set consists of 447,909 reads from the 18S rRNA gene obtained from 10 temporal freshwater samples. Applied to our example data set, the program eliminated approximately 37% of all sequences (Table 1). Hence a total of 281,003 (~63%) sequences could be used for downstream analyses. On Macintosh OS X version 10.6.2 with a single processor, CANGS ( takes 6.5 hours for processing this data set. If the user skips the removal of singletons the program takes only 20 minutes for the same data set.
Table 1

Number of reads eliminated at different steps of the module

Order of steps


Total no. of sequences

No. of sequences considered

No. of sequences discarded


Removal of Adapter B





Filtering sequences with ambiguities





Removal of singletons





Grouping of sequences according to bar codes





Filtering sequences according to length threshold





Removal of PCR primers





Quality filtering





Total Sequences




Sequence Analysis

The sequence analysis module of CANGS performs various downstream analyses: 1) linking 454 reads with the taxonomic description of the most similar sequence from the NCBI database by (Taxonomy Analysis) program, 2) measuring the overlap of OTUs between samples by (Blast Analysis) program and 3) estimating species richness by the (Rarefaction Analysis) module.

Taxonomy Analysis, this program classifies the processed 454 reads by assessing their similarity to taxonomic entries in the NCBI database. This analysis requires the nucleotide preformatted BLAST database from to be installed, which is done using the perl program "" [9]. The script BLASTs the non-redundant sequences against this database. In a second step the best hit(s) from the BLAST search are used to retrieve the taxonomic path - either for all sequences or only for a taxonomic group of interest. In the case of multiple best hits with identical E-value, this program selects the hit with the most detailed taxonomic classification and links it with the non-redundant query sequence. If CANGS identifies a conflict we provide the option to assign taxonomic status by the majority rule. The partial output of this program is shown in Figure 3.
Figure 3

An example output of taxonomy analysis in CANGS. An example of final output of the Taxonomy Analysis ( module. It is a tabular output, the columns from left to right are 1) Query sample non-redundant sequences 2) BLAST percent similarity 3) BLAST e-value 4) Closest NCBI sequence Accession ID 5) NCBI sequence species name and 6) Taxonomic path of the closest NCBI sequence.

BLAST Analysis Studies of species diversity are frequently designed to compare species richness and species composition among different samples. The (Blast Analysis) program performs a BLAST analysis of non-redundant sequences in one sample against non-redundant sequences in any number of other samples. For user convenience the software automatically generates the BLAST database(s) required for the analyses. The output of the BLAST search(es), is parsed and a tabular output is created. As it may be of interest to group sequences with different similarities, the program could be customized to group sequences up to a specified similarity. A similarity cutoff of 100 should be used to group only identical sequences (ignoring gaps). In the tabular output, the number of sequences shared between the two data sets is reported for every species as shown in Figure 4. The similarities given in the output are calculated as follows:
Figure 4

An example output of BLAST analysis in CANGS. An example of final output of the BLAST Analysis ( module. It is a tabular output; the columns from left to right are 1) query sample non-redundant sequences 2) copy number of a query sequence in the same sample 3-7) copy number of query sequence in different target samples.

Hence, gaps are not considered.

Rarefaction Analysis Several software packages exist for performing rarefaction analysis [7, 8, 10]. The script (Rarefaction Analysis) program links the data processed by CANGS with two popular rarefaction analysis software packages with minimal user interference: MOTHUR [7] and Analytic Rarefaction [8]. For MOTHUR is calculating the pairwise genetic distance by using the "mafft-distance" program of MAFFT executables [6]. The mafft-distance program takes non-redundant sequences generated by as input and gives the corresponding genetic distance table as output. For the Analytic Rarefaction software, first calculates the abundance of each sequence in the data set using BLAST, as described above. Compared to pattern matching this procedure allows to consider sequences with gaps jointly.


CANGS is a user-friendly tool for primer clipping and quality filtering of 454 sequences. CANGS is primarily designed to handle data from amplicon resequencing projects in the context of diversity studies. The tool can be downloaded at

Availability & requirements

Project name: CANGS--Cleaning and Analyzing 454 GS-FLX sequences.


Operating System: Mac OS X, Linux and any other UNIX like system

Programming language: Perl 5.8.8

Other requirements: BioPerl, BLAST, MAFFT, MOTHUR, Analytic Rarefaction.

License: GNU General Public License.

Any restrictions to use by non-academics: license needed.



We are thankful to members of the Institut für Populationsgenetik and Jens Boenigk for helpful discussion. This work was supported by FWF grants (No. P19467-B11) to CS.

Authors’ Affiliations

Institut für Populationsgenetik, Veterinärmedizinische Universität Wien


  1. Thomas RK, Nickerson E, Simons JF, Jänne PA, Tengs T, Yuza Y, Garraway LA, LaFramboise T, Lee JC, Shah K: Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nature medicine. 2006, 12 (7): 852-855. 10.1038/nm1437.PubMedView ArticleGoogle Scholar
  2. Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML: Microbial population structures in the deep marine biosphere. Science. 2007, 318 (5847): 97-100. 10.1126/science.1146689.PubMedView ArticleGoogle Scholar
  3. NCBI (National Center for Biotechnology Information). []
  4. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Research. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMed CentralPubMedView ArticleGoogle Scholar
  5. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology. 1990, 215 (3): 403-410.PubMedView ArticleGoogle Scholar
  6. Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research. 2005, 33 (2): 511-518. 10.1093/nar/gki198.PubMed CentralPubMedView ArticleGoogle Scholar
  7. Schloss PD, Handelsman J: Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied and Environmental Microbiology. 2005, 71 (3): 1501-1506. 10.1128/AEM.71.3.1501-1506.2005.PubMed CentralPubMedView ArticleGoogle Scholar
  8. Analytic Rarefaction 1.4. []
  9. []
  10. Yu Y, Breitbart M, McNairnie P, Rohwer F: FastGroupII: a web-based bioinformatics platform for analyses of large 16S rDNA libraries. BMC Bioinformatics. 2006, 7 (7): 57-10.1186/1471-2105-7-57.PubMed CentralPubMedView ArticleGoogle Scholar


© Schlötterer et al; licensee BioMed Central Ltd. 2010

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.