454 antibody sequencing - error characterization and correction
© Dimitrov et al; licensee BioMed Central Ltd. 2011
Received: 1 July 2011
Accepted: 12 October 2011
Published: 12 October 2011
454 sequencing is currently the method of choice for sequencing of antibody repertoires and libraries containing large numbers (106 to 1012) of different molecules with similar frameworks and variable regions which poses significant challenges for identifying sequencing errors. Identification and correction of sequencing errors in such mixtures is especially important for the exploration of complex maturation pathways and identification of putative germline predecessors of highly somatically mutated antibodies. To quantify and correct errors incorporated in 454 antibody sequencing, we sequenced six antibodies at different known concentrations twice over and compared them with the corresponding known sequences as determined by standard Sanger sequencing.
We found that 454 antibody sequencing could lead to approximately 20% incorrect reads due to insertions that were mostly found at shorter homopolymer regions of 2-3 nucleotide length, and less so by insertions, deletions and other variants at random sites. Correction of errors might reduce this population of erroneous reads down to 5-10%. However, there are a certain number of errors accounting for 4-8% of the total reads that could not be corrected unless several repeated sequencing is performed, although this may not be possible for large diverse libraries and repertoires including complete sets of antibodies (antibodyomes).
The experimental test procedure carried out for assessing 454 antibody sequencing errors reveals high (up to 20%) incorrect reads; the errors can be reduced down to 5-10% but not less which suggests the use of caution to avoid false discovery of antibody variants and diversity.
The high-throughput 454 sequencing method has been applied for antibodies but errors associated with antibody repertoires and libraries are unknown and not yet quantified. Recently, antibodies from normal humans and patients have been sequenced and analyzed for identifying usage of different allelic genes, mutations and clonal expansions to help further our understanding of immune repertoire and clinical applications [1–3]. The extremely sensitive relationship between antibody sequence and function requires more accurate 454 sequencing. Antibodies have similar frameworks interspersed with highly variable regions and are formed by the recombination of two or three different genes, VJ for light and VDJ for heavy chains. Consequently, the available error correcting methods and algorithms developed for high-throughput sequence data [4–6] may not be relevant to 454 antibody sequencing. Although potential errors due to single nucleotide substitution and small InDels, insertion and deletion, associated with 454 pyrosequencing are known , exact error quantification and measurement of precision is not possible without conducting repeat sequencing runs of known antibody sequence target. Therefore, we performed 454 sequence analyses of six different antibodies at varied concentrations twice over and compared the reads with the original sequences determined by standard Sanger sequencing. This allowed us to identify the types of errors and estimate error rates, and suggest corrections applicable to 454 antibody sequencing for better confidence in the assessment of data quality.
Results and Discussion
454 sequencing of six different antibodies at different concentrations (3-fold dilution) produced different number of sequences (number in parentheses denotes the results from the second run)
Number of molecules
Number of 454 sequences
The consequence of 454 antibody sequencing errors affecting the functionality of antibodies as determined by IMGT/HighV-QUEST using sequence data obtained from the two repeat 454 runs, 1 and 2.
Run 1 (%)
Run 2 (%)
Insertion/deletion occurred in multiple reads mostly at homopolymeric regions of 2-3 nucleotide length as well as at the random sites, for example, shown for run 1 of control antibody #1.
Number of Sequences
Description of variation
yes: 3G in place of 4
yes: 4A in place of 3
yes: 4T in place of 3
yes: 4G in place of 2; no: 1 random G
2 + 1 insertion
yes: 3G in place of 4
yes: 4G in place of 3
yes: 4G in place of 3
yes: 4C in place of 3
yes: 3G in place of 2; no: 1 random G
1 + 1 insertion
yes: 3G in place of 2
Importantly, we found that the errors caused by single nucleotide substitution were difficult to identify and there could be a limitation for the error corrections. To locate these errors, we performed multiple sequence alignment of erroneous sequences with single substitution errors, which indicated that these errors distributed stochastically along the frameworks and CDR regions, up to 4% and 8% for Run 1 and 2, respectively. The distribution of these substitution errors are shown in additional file 3. Most of these single substitutions resulted in replacement mutations and therefore could not be easily detected unless they cause changes to the invariant residues such as cysteine and tryptophan which define the boundaries of the CDR3 as observed in a few cases. Similarly, mixed variants were identified to have 2 or more changes involving different error types at a rate of ~10% which, with the exception of insertions or deletions, would have been difficult to detect and rectify by post-computational processing.
Our findings indicate that 454 antibody sequencing produces about 60% accurate reads which could routinely be improved above 80% by correcting insertions/deletions occurring at homopolymer sites as well as random sites, if represented by multiple reads and resulting in frameshifts, stop codons or modification of conserved residues. We noted that other types of errors caused by errors involving 2 or more nucleotide changes might be challenging; however, the use of post-processing methods such as IMGT V-QUEST and other antibody-specific algorithms to detect and rectify the errors to be developed might improve the accuracy up to 90-95%. The randomly occurring single nucleotide substitution errors accounting for 4-8% of the total reads as observed in run 1 and 2 of the larger data sets of control antibodies caused replacement mutations that in many instances can not be detected. These replacement errors may contribute to the anticipated residual errors after the post-sequencing error correction. Therefore, this type of errors could potentially lead to false discovery of novel variants or diversity unless verified with repeated runs or observed in multiple clonally-related sequences. We suggest that only those single nucleotide substitutions affecting the invariant residues of the highly-conserved frameworks as well as those residues at the complementarity determining regions that are conserved among the alleles and clonally related sequences among polyclonal repertoires may be detected and corrected. These results could be useful for identification and correction of 454 antibody sequencing errors.
We used primers that were synthesized to include the Roche A and B adaptor sequences along with target amplification sequences: ControlF- 5'-CCATCTCATCCCTGCGTGTCTCCGACTCAGGCCACCAGCCATGGCC-3' (sense primer) and HR2:5'-/5BioTEG/- CCTATCCCCTGTGTGCCTTGGCAGTCTCAGGTCACAAGATTTGGGCTCAAC -3' (antisense primer) where 5BioTEG is a 5'-biotin-TEG moiety conjugated to the 5' end of the primer. The gene fragments were PCR amplified through 12 cycles and other details were followed according to the Roche 454 sequencing technical bulletin. Six different DNA samples encoding for antibody heavy chains of different lengths of known sequences (see additional file 4 for sequences in FASTA format) were prepared at 3-fold serial dilution and subjected to pyrosequencing using the Roche/454 Genome Sequencer FLX. The numbers of molecules in antibody #1 were empirically estimated based on the content of DNA and concentration which was approximately set to be 100. The numbers of molecules present in the remaining antibodies were calculated by taking into account of the 3-fold serial dilution. The 454 sequence data were trimmed for quality and only full-length sequences covering the entire antibody variable domain, FV region consisting all three complementarity determining regions (CDRs) along with frameworks (FRs), were retained (see additional files 5 and 6 for sequences in FASTA format). Sequence identities were calculated based on the pairwise alignment for each of 454 antibody sequence against pertinent known DNA sequences using local BLAST implemented in BioEdit v7.0.9  with additional parameters, -G-1 -q-1 -r2. The BLAST output data comprising the start and end points of query and subject, and the number of mismatches including or excluding gaps were used to determine the different types of errors introduced during 454 pyrosequencing of antibodies. The IMGT/HighV-QUEST analysis tool  was extensively used to analyze 454 antibody sequence data for (i) locating the insertion and deletion errors along the antibody variable domain, (ii) identifying the consequences of such errors including the number of productive and unproductive genes, stop codons and no rearrangements, and (iii) correcting the antibody sequence errors which can be achieved by selecting the option "search for insertions and/or deletions" of advanced parameters available in the tool. The output results were stored in CSV files containing functionality information, list of insertion and deletion errors, and DNA as well as translated amino acid sequences after applying the appropriate sequence corrections. The algorithm and method implementation in IMGT/HighV-QUEST tool for determining the antibody functionality and correcting the antibody sequence errors were described previously . Briefly, the frequencies of productive and unproductive sequences are calculated based on the absence of stop codons with in-frame junctions and stop codons with or without out-of-frame junctions, respectively. For correcting the antibody sequence errors, the IMGT tool uses two alignment steps (Smith and Waterman algorithm ) comparing the user sequence and the closest V genes and alleles in the database. This would reveal the frameshifts caused by sequencing errors to be fixed. First, if insertions are detected, they are excluded from the user sequence as they are not compatible with IMGT numbering and their locations are identified. If deletions are detected, gaps are introduced in the user sequence to restore the IMGT numbering. After insertion and/or deletion detection steps, the identification of V gene and allele are performed again and corrected sequences are provided. Also, we computed the nucleotide compositions for all reads from two sequencing runs of antibody #1. Statistical calculations were carried out using SAS JMP9® statistical software (SAS Institute, Cary, NC) as well as Excel macros using the results as obtained from the BLAST and IMGT/HighV-QUEST. Graphical plots were made using Microsoft Excel 2010.
We thank the Laboratory of Molecular Technology of SAIC-Frederick Inc. for providing Roche 454 sequencing service. This research was supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research, and the Gates Foundation to DSD, and by Federal funds from the NIH, National Cancer Institute, under Contract No. NO1-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does the mention of trade names, commercial products, or organizations imply endorsement by the U. S. Government.
- Glanville J, Zhai W, Berka J, Telman D, Huerta G, Mehta GR, Ni I, Mei L, Sundar PD, Day GM, Cox D, Rajpal A, Pons J: Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proceedings of the National Academy of Sciences of the United States of America. 2009, 106 (48): 20216-20221. 10.1073/pnas.0909775106.PubMedPubMed CentralView Article
- Boyd SD, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf B, Jones CD, Simen BB, Hanczaruk B, Nguyen KD, Nadeau KC, Egholm M, Miklos DB, Zehnder JL, Fire AZ: Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Massively Parallel V-D-J Pyrosequencing. Science Translational Medicine. 2009, 1 (12): 12ra23-10.1126/scitranslmed.3000540.PubMedPubMed CentralView Article
- Boyd SD, Gaëta BA, Jackson KJ, Fire AZ, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf B, Jones CD, Simen BB, Hanczaruk B, Nguyen KD, Nadeau KC, Egholm M, Miklos DB, Zehnder JL, Collins AM: Individual Variation in the Germline Ig Gene Repertoire Inferred from Variable Region Gene Rearrangements. Journal of Immunology. 2010, 184 (12): 6986-6992. 10.4049/jimmunol.1000445.View Article
- Ilie L, Fazayeli F, Ilie S: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics. 2011, 27 (3): 295-302. 10.1093/bioinformatics/btq653.PubMedView Article
- Lassmann T, Hayashizaki Y, Daub CO: SAMStat: monitoring biases in next generation sequencing data. Bioinformatics. 2011, 27 (1): 130-131. 10.1093/bioinformatics/btq614.PubMedPubMed CentralView Article
- Nguyen P, Ma J, Pei D, Obert C, Cheng C, Geiger TL: Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire. BMC Genomics. 2011, 12: 106-10.1186/1471-2164-12-106.PubMedPubMed CentralView Article
- Kircher M, Kelso J: High-throughput DNA sequencing - concepts and limitations. Bioessays. 2010, 32 (6): 524-536. 10.1002/bies.200900181.PubMedView Article
- Dimitrov DS: Therapeutic antibodies, vaccines and antibodyomes. Mabs. 2010, 2 (3): 347-356. 10.4161/mabs.2.3.11779.PubMedPubMed CentralView Article
- Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser. 1999, 41: 95-98.
- Alamyar E, Giudicelli V, Duroux P, Lefranc MP: IMGT/HighV-QUEST: A High-Throughput System and Web Portal for the Analysis of Rearranged Nucleotide Sequences of Antigen Receptors - High-Throughput Version of IMGT/V-QUEST. JOBIM. 2010, Paper 60
- Brochet X, Lefranc MP, Giudicelli V: IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res. 2008, 36: W503-508. 10.1093/nar/gkn316.PubMedPubMed CentralView Article
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.