A shot in the genome: how accurately do shotgun 454 sequences represent a genome?
© Meglecz et al.; licensee BioMed Central Ltd. 2012
Received: 29 November 2011
Accepted: 28 May 2012
Published: 28 May 2012
Next generation sequencing (NGS) provides a valuable method to quickly obtain sequence information from non-model organisms at a genomic scale. In principle, if sequencing is not targeted for a genomic region or sequence type (e.g. coding region, microsatellites) NGS reads can be used as a genome snapshot and provide information on the different types of sequences in the genome. However, no study has ascertained if a typical 454 dataset of low coverage (1/4-1/8 of a PicoTiter plate leading to generally less than 0.1x of coverage) represents all parts of genomes equally.
Partial genome shotgun sequencing of total DNA (without enrichment) on a 454 NGS platform was used to obtain reads of Apis mellifera (454 reads hereafter). These 454 reads were compared to the assembled chromosomes of this species in three different aspects: (i) dimer and trimer compositions, (ii) the distribution of mapped 454 sequences along the chromosomes and (iii) the numbers of different classes of microsatellites. Highly significant chi-square tests for all three types of analyses indicated that the 454 data is not a perfect random sample of the genome. Only the number of 454 reads mapped to each of the 16 chromosomes and the number of microsatellites pooled by motif (repeat unit) length was not significantly different from the expected values. However, a very strong correlation (correlation coefficients greater than 0.97) was observed between most of the 454 variables (the number of different dimers and trimers, the number of 454 reads mapped to each chromosome fragments of one Mb, the number of 454 reads mapped to each chromosome, the number of microsatellites of each class) and their corresponding genomic variables.
The results of chi square tests suggest that 454 shotgun reads cannot be regarded as a perfect representation of the genome especially if the comparison is done on a finer scale (e.g. chromosome fragments instead of whole chromosomes). However, the high correlation between 454 and genome variables tested indicate that a high proportion of the variability of 454 variables is explained by their genomic counterparts. Therefore, we conclude that using 454 data to obtain information on the genome is biologically meaningful.
To test the representativeness (i.e. the probability of sequencing any given part of the genome is proportional to its size) of 454 reads obtained by partial genome shotgun sequencing of total DNA without enrichment, 454 reads of the honey bee (Apis mellifera) were compared to its genome assembly Amel 4.0 . Genome sequences refer to the assembled chromosomes of A. mellifera DH4 strain (a total of 217 Mb) downloaded from the genome section of Genbank. It is essential to note that we have used the genome assembly of A. mellifera, as a reference, but this is not a gold standard. Most genome assemblies are not complete and the uncompleted parts usually contain repetitive regions. Therefore it is difficult to know what the real composition of the genome is, especially for repetitive regions.
The total length of 454 reads were 10.4 Mb, which gives 0.048 fold coverage. This is undoubtedly very low for genome sequencing and assembly, but it is of typical order of magnitude for many non-model species, where the aim is not to sequence the whole genome but either to obtain a snapshot of the genome  or simply have sequence data to establish molecular markers [3, 4]. The aim of this paper was to test how well a limited amount of DNA sequence can be used for estimating genome composition, therefore we did not eliminate any part of the genome such as repetitive or low complexity regions from the analyses. Although, in our case, a genome guided assembly would have been feasible, this is not the situation for non-model organisms. Since we are focusing on the suitability of 454 genome shotgun sequencing to obtain information on genome composition in non-model organism we did not use genome guided assembly. Additionally, we have chosen not to use de novo assembly of the 454 reads since, due to the low coverage and single end library, the assembly of repetitive region is particularly difficult. This can introduce a bias either by leaving the repetitive regions non-assembled, thus leading to overrepresentation of these regions in 454 reads or on the other extreme by eliminating some of the repetitive DNA that are highly similar, but represent different loci. Gomez-Alvarez et al. found that a non-negligible portion of the 454-based sequencing reads are artificial replicates in bacterial metagenomes . This potential artefact of 454 data is likely to be due to amplified DNA attaching to empty beads during emulsion PCR creating multiple reads from a single template . This artefact is very limited in our dataset, where only 1.48% of the reads appear to be part of a cluster, and the largest cluster contains only three reads (evaluated by 454 Replicate Filter ; cutoff 0.95, length requirement 0 and initial base pair match 5). Since duplicates are rarely removed in a typical 454 sequence analysis and the sequences that form clusters arise randomly , we did not eliminate potentially duplicated reads.
Three different types of comparisons were used in the analyses: (i) counts of observed di- and trinucleotide sequences in both the 454 data and in the complete genomes (variables: number of dimers and number of trimers), (ii) the distribution of the 454 sequence reads along the chromosomes (variables: number of 454 reads mapped to each chromosome fragments of one Mb, number of 454 reads mapped to each chromosome) and (iii) counts of different microsatellite classes as defined by their motif, repeat unit length and repeat number, and repeat unit length only (variables: number of microsatellites of each class). Dimer, trimer and microsatellite frequencies were chosen for the analyses due to their straightforward and unambiguous identification in the 454 data without the need of preliminary genome information like mRNA, protein or transposable element sequences.
DNA from A. mellifera mellifera from Nimes (France) was sheared by sonication. 1 μg of purified DNA was used for 454 FLX Titanium library (Roche Applied Science) preparation, according to the manufacturer’s protocols, at Genoscreen (Lille, France). Emulsion PCR (emPCR) was carried out at a ratio of 1 copy per bead, with subsequent disruption with isopropanol. Beads containing amplified DNA fragments were enriched and recovered for sequencing, to provide 50,000 to 70,000 enriched beads for each library. The recovered ssDNA beads were packed onto region 1/8 of a 70 mm × 75 mm Titanium PicoTiter plate and sequenced with 200 cycles. Sample preparation and analytical processing, such as base calling and filtering, were performed at Genoscreen (Lille, France), according to the manufacturer’s protocol for the Titanium series and Genome Sequencer FLX System Software (v2.3).
Reads were deposited to Short Read Archives of NCBI (SRS150264.1, 10.4 Mb of total length, 37870 reads).
In order to establish the link between variables derived from 454 data and from their corresponding genomes, Pearson’s correlation tests were performed except when the variables deviated from normal distribution (as determined by a Shapiro-Wilk normality test). In these latter cases, Spearman’s rank correlation tests were used. When these tests resulted in significant correlation, we determined if the distribution of the variables were identical by performing chi-square tests. If the expected number of observations were greater than five in all classes, the P-value was obtained from χ 2 distribution. Otherwise, a permutation test (N = 2000) was performed. To visualise the contribution of each element to the deviation between expected and observed values in the chi-square tests, barplots were created with the chi-square values of each element multiplied by the sign of the difference: negative for underrepresentation- and positive for over-representation. All statistical analyses were performed with R .
Dimer and trimer composition
Results of correlation and Chi-square tests between 454 and genome variables
Number of dimers
Number of trimers
Number of dimers without AA/TT
Number of trimers without AAA/TTT
Number of 454 reads per fragment
Number of 454 reads per chromosome
Number of different microsatellite types
(pooled by repeat unit length and repeat number)
Number of microsatellites with di-trinucleotide motifs
(pooled by motif)
Number of different microsatellite types
(pooled by repeat unit length)
Aird et al. have found that a PCR amplification step in Illumina sequencing introduces a considerable downward bias of regions with extreme A/T compositions . Although the same phenomenon has not been studied in 454 sequencing, since it also includes a PCR amplification, it is possible that A/T or G/C rich regions are less well sequenced than the rest of the genome. This is of particular interest, since the genome of A. mellifera is particularly A/T rich : 65% in mapped contigs and even higher in unmapped ones. Since a particular effort has been made to include extreme AT rich regions in the genome sequencing  the underrepresentation of A/T homopolymer motifs in the 454 reads could also come from this potential bias. However, the count of other A/T-rich non-homopolymer dimer and trimer motifs in the 454 reads do not deviate strongly from the genome counts [see Figures S1 and S2 in Additional file 1, thus the 454 specific homopolymer length bias is a more likely explanation for the underrepresentation of the A/T homopolymers.
At a genomic scale the very large number of observations renders chi-square tests extremely powerful to detect differences between 454 and genome distributions. It is undeniable that the 454 data does not reflect perfectly the genome compositions regarding dimer and trimer motifs, as evidenced by the significant results of the chi-square tests, however the very high correlation coefficient between 454 and genome data indicates that 0.972 =0.94 of the variability of genome variables (here counts of dimer or trimer motifs) is explained by their 454 counterparts. Therefore, in spite of the statistically significant differences detected by the chi-square tests, we conclude that using 454 data to obtain information on the genome is biologically meaningful.
Distribution of the 454 sequence reads along the chromosomes
The 37,870 reads obtained from a 454 run with a total length of 10.4 Mb were BLASTed (e = 1e-20, no dust filter, default parameters otherwise) to the genome (183.3 Mb without gaps). No filtering was done for low complexity sequences in order to be able to retain repetitive and low complexity sequences. For each 454 read the best hit was selected and accepted only if (i) the hit covered at least 80% of the 454 read and (ii) the identity over the aligned region was higher than 90%. When there were more than one hit with the same e-value a single hit was selected randomly. The best hit position on the chromosomes was recorded and the number of hits was counted for each Mb fragment of chromosome. These chromosome fragments did not overlap. If a 454 read was mapped on the junction of two fragments it was attributed to the first. Since the genome assembly contain a non-negligible proportion of gaps (Ns), for each chromosome fragment a number of the informative bases (not Ns) were counted and used to calculate the corrected number of hits for each Mb fragment (Number of 454 mapped into the fragment/Number of informative bases in fragment*1000000).
After the above described filtering, 29,862 reads (78.9% of the total number of reads) were mapped to the chromosomes. A possible explanation for this relatively low mapping success is a high proportion of gaps in the current A. mellifera assembly (16.1%).
A possible explanation of the overrepresentation of a few fragments is that some repetitive regions are missing from the genome assembly. If only a few loci of a transposable element is included in the assembly, while many others are in the non-assembled part of the genome, 454 reads of these later loci can be erroneously mapped into the assembled loci, and cause overrepresentation of the region. Hiller et al. have shown that the overrepresentation of the only copy of an rDNA locus in Caenorhabditis elegans assembled genome is the result of the presence of several unmapped rDNA loci . The genome of A. mellifera contains an unusually low number of transposons and retrotransposons (constituting only 1% of the assembled genome), most of which are members of the Mariner family . Genomic screens for highly repetitive elements identified 3% of the genome, mostly mapping to telomeres and centromeres of the chromosomes by in situ mapping [11, 12]. This implies that the apparent heterogeneity among the chromosome regions can be in fact due to technical artefact of mapping, and representativeness of the different regions are better in reality than we could expect it from our tests.
Although the dimer and trimer composition of the 454 reads does not seem to support bias in base composition in the 454 reads, we have tested if the base composition of each fragments (GC%) correlated with the corrected number of mapped hits. Spearman’s rank correlation test indicated a highly significant correlation (P = 1.12e-10), but a low correlation coefficient (0.415) indicates that a high proportion of the variability of the 454 read distribution cannot be explained by the G/C content of the fragment. Nevertheless the general trend is that fewer reads are mapped to fragments with low G/C content, which is likely to be due to amplification bias during the emulsion PCR step of sequencing . A similar phenomenon was also observed in Solexa sequencing that also contains a PCR step before the actual sequencing .
At last, it is possible that the general background heterogeneity among chromosome fragment can be a result of the low coverage, and is thus due to chance events.
Microsatellites are commonly used as molecular markers, thus are of interest to the scientific community. Apis mellifera has a large number of microsatellites compared to other insects , and indeed linkage map based on microsatellites provided valuable assistance in the genome assembly .
Pearson’s correlation tests were performed on the number of microsatellites of different classes between the 454 data and its corresponding genome, and a highly significant correlation was detected in all three kinds of pooling (Table 1) [see Figures S3 and S4 in Additional file 1]. However, by pooling microsatellites by motif or by repeat unit length and repeat number, the chi-square tests still showed significant differences of the microsatellite class distribution between the 454 data and the corresponding genome (Table 1). Most of the deviation came from the under-representation of long (>8 repeats) microsatellites with a dinucleotide motif when pooling by repeat unit length and repeat number [see Figure S5 in Additional file 1], and by the overrepresentation of the AAC motif when pooling by motif [see Figure S6 in Additional file 1]. Thus, the possible downwards bias of A/T rich regions suggested by the analysis of the mapped 454 read distribution is not detected in the microsatellite composition.
By pooling microsatellite numbers by repeat unit length, the chi-square test became non-significant (χ 2 = 8.7, df = 4, P = 0.0685). This indicates that, although 454 data does not reflect perfectly the microsatellite composition of the genome if microsatellites are pooled at a fine scale, by pooling data into larger categories, 454 data gives a good approximation of the genome content.
Next generation sequencing has increased the quantity of sequences by several orders of magnitudes compared to pre-genomics area. This high number of observations increases the power of statistical tests which become very sensitive to detect small differences among most of the studied variables. What does this mean practically to a biologist? Where do the differences between genome assembly and 454 shotgun come from? There are five potential causes: (i) the low coverage of the 454 reads increases the probability that 454 reads differ from the genome assembly by chance alone; (ii) The assembled genome is a DH4 strain of primarily A.m. ligustica, while we have 454 reads of A.m.mellifera. These subspecies are two of the most distantly related among all subspecies of A. mellifera. The genetic variation between these subspecies may be the cause of some of the observed differences between the genome and the 454 data, but is unlikely to introduce a systematic bias; (iii) The emulsion PCR has been shown to amplify sequences with extreme C/G content less efficiently . This can explain that A/T rich chromosome fragments are generally less well covered by 454 reads in our analyses. This reduced PCR efficiency could also lead to the underrepresentation of A/T rich motifs (both in microsatellites and genome-wise counts), but since it is not detected in our analyses this source of error is unlikely to be important; (iv) Homopolymer length bias characteristic of 454 sequencing is likely to be the reason of A/T homodimer and homotrimer bias observed in our analyses; (v) The incomplete genome assembly leads to a possibility of erroneously mapping repetitive elements to the assembled region, which thus appear to be overrepresented. Therefore, in this respect 454 shotgun sequences might provide a better estimation of repetitive DNA content than the genome assembly.
Can we use 454 data to infer information on the genome composition? No doubt that the 0.05x coverage 454 data is not a random sample of the assembled part of the genome. However, the very strong correlation between 454 and genome variables and a lack of significant deviation for pooled data implies that 454 sequences provide a useful approximation of the genomic content. Furthermore, increasing the genome coverage is likely to reduce random sampling errors, thus could provide an ever better estimation.
We thank Gabriel Nève, Terry Bertozzi and Steve Donnellan for discussions.
- The Honeybee Genome Sequencing Consortium: Insights into social insects from the genome of the honeybee Apis mellifera. Nature. 2006, 443: 931-949. 10.1038/nature05260.PubMed CentralView ArticleGoogle Scholar
- Rasmussen DA, Noor MA: What can you do with 0.1× genome coverage? A case study based on a genome survey of the scuttle fly Megaselia scalaris (Phoridae). BMC Genomics. 2009, 10: 382-10.1186/1471-2164-10-382.PubMedPubMed CentralView ArticleGoogle Scholar
- Malausa T, Gilles A, Meglécz E, Blanquart H, Duthoy S, Costedoat C, Dubut V, Pech N, Castagnone-Sereno P, DéLye C, Feau N, Frey P, Gauthier P, Guillemaud T, Hazard L, Le Corre V, Lung-Escarmant B, Malé P-JG, Ferreira S, Martin J-F: High-throughput microsatellite isolation through 454 GS-FLX Titanium pyrosequencing of enriched DNA libraries. Mol Ecol Resour. 2011, 11: 638-644. 10.1111/j.1755-0998.2011.02992.x.PubMedView ArticleGoogle Scholar
- Gardner MG, Fitch AJ, Bertozzi T, Lowe AJ: Rise of the machines - recommendations for ecologists when using next generation sequencing for microsatellite development. Mol Ecol Resour. 2011, 11: 1093-1101. 10.1111/j.1755-0998.2011.03037.x.PubMedView ArticleGoogle Scholar
- Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3: 1314-1317. 10.1038/ismej.2009.72.PubMedView ArticleGoogle Scholar
- Briggs AW, Stenzel U, Johnson PLF, Green RE, Kelso J, Prufer K, Meyer M, Krause J, Ronan MT, Lachmann M, Paabo S: Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci. 2007, 104: 14616-14621. 10.1073/pnas.0704665104.PubMedPubMed CentralView ArticleGoogle Scholar
- Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin J-F: Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics. 2011, 12: 245-10.1186/1471-2164-12-245.PubMedPubMed CentralView ArticleGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMedPubMed CentralGoogle Scholar
- Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12: R18-10.1186/gb-2011-12-2-r18.PubMedPubMed CentralView ArticleGoogle Scholar
- Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Whole-genome sequencing and variant discovery in C. elegans. Nat Meth. 2008, 5: 183-188. 10.1038/nmeth.1179.View ArticleGoogle Scholar
- Beye M, Moritz RFA: Characterization of Honeybee (Apis mellifera L.) Chromosomes Using Repetitive DNA Probes and Fluorescence in situ Hybridization. J Hered. 1995, 86: 145-150.PubMedGoogle Scholar
- Sahara K, Marec F, Traut W: TTAGG telomeric repeats in chromosomes of some insects and other arthropods. Chromosome Res. 1999, 7: 449-460. 10.1023/A:1009297729547.PubMedView ArticleGoogle Scholar
- Pannebakker BA, Niehuis O, Hedley A, Gadau J, Shuker DM: The distribution of microsatellites in the Nasonia parasitoid wasp genome. Insect Mol Biol. 2010, 19: 91-98.PubMedView ArticleGoogle Scholar
- Solignac M, Zhang L, Mougel F, Li B, Vautrin D, Monnerot M, Cornuet J-M, Worley KC, Weinstock GM, Gibbs RA: The genome of Apis mellifera: dialog between linkage mapping and sequence assembly. Genome Biol. 2007, 8: 403-10.1186/gb-2007-8-3-403.PubMedPubMed CentralView ArticleGoogle Scholar
- Zayed A, Whitfield CW: A genome-wide signature of positive selection in ancient and recent invasive expansions of the honey bee Apis mellifera. Proc Natl Acad Sci USA. 2008, 105: 3421-3426. 10.1073/pnas.0800107105.PubMedPubMed CentralView ArticleGoogle Scholar
- Whitfield CW, Behura SK, Berlocher SH, Clark AG, Johnston JS, Sheppard WS, Smith DR, Suarez AV, Weaver D, Tsutsui ND: Thrice Out of Africa: Ancient and Recent Expansions of the Honey Bee, Apis mellifera. Science. 2006, 314: 642-645. 10.1126/science.1132772.PubMedView ArticleGoogle Scholar