False positives complicate ancient pathogen identifications using high-throughput shotgun sequencing
BMC Research Notes volume 7, Article number: 111 (2014)
Identification of historic pathogens is challenging since false positives and negatives are a serious risk. Environmental non-pathogenic contaminants are ubiquitous. Furthermore, public genetic databases contain limited information regarding these species. High-throughput sequencing may help reliably detect and identify historic pathogens.
We shotgun-sequenced 8 16th-century Mixtec individuals from the site of Teposcolula Yucundaa (Oaxaca, Mexico) who are reported to have died from the huey cocoliztli (‘Great Pestilence’ in Nahautl), an unknown disease that decimated native Mexican populations during the Spanish colonial period, in order to identify the pathogen. Comparison of these sequences with those deriving from the surrounding soil and from 4 precontact individuals from the site found a wide variety of contaminant organisms that confounded analyses. Without the comparative sequence data from the precontact individuals and soil, false positives for Yersinia pestis and rickettsiosis could have been reported.
False positives and negatives remain problematic in ancient DNA analyses despite the application of high-throughput sequencing. Our results suggest that several studies claiming the discovery of ancient pathogens may need further verification. Additionally, true single molecule sequencing’s short read lengths, inability to sequence through DNA lesions, and limited ancient-DNA-specific technical development hinder its application to palaeopathology.
Diseases have ravaged human populations throughout history. The pathogens responsible for many historic epidemics remain either unknown or speculative (e.g. the Plague of Athens [1, 2]). Identification of these historic pathogens is critical to understanding disease evolution, which in turn has direct impacts on the development of effective medical treatments for these conditions. The discovery that ancient pathogen-diagnostic biomolecules may survive in archaeological bones and the development of high-throughput DNA sequencing technologies have opened new possibilities for determining past disease [3–14].
Most biomolecular palaeopathological studies have been based on the polymerase chain reaction (PCR) (e.g. [3–7]). Supplementary methods include lipid and immunological assays [6–12]. Due to the limitations of molecular preservation and detection, negative assays do not constitute evidence of absence of any infection [15–17]. False positives are a serious concern since bacteria, viruses and fungi are ubiquitous and the specificity of molecular tests may not be able to distinguish between target disease agents and their closely related relatives [18–20]. Many PCR-based palaeopathological reports are contentious due to insufficient anti-contamination procedures or lack of experimental and analytical rigor [21–23].
Furthermore, PCR-based methods are limited in their ability to identify unknown pathogens since they require candidate disease agents of known sequence in advance. Although there are a few studies that attempt such de novo identification (e.g. [5, 24]), the vast majority of investigations have been restricted to targeted, likely diseases. While DNA capture methods, including in-solution [e.g.  and array-hybridization [e.g.  capture, have become the gold standard for palaeopathogenomics, no such technology is yet available for palaeopathogen diagnosis. In theory, it is possible to identify unknown diseases by high-throughput DNA sequencing of known afflicted individuals and comparing these pools with suitable controls (such as the surrounding soil and uninfected archaeologically related individuals). Organisms found only in the infected population would thus be candidates for the responsible pathogens. Nevertheless, since the majority of sequences in most ancient DNA samples are derived from the environment (particularly soil bacteria), false positives could remain problematic in high-throughput analyses [23, 25]. Moreover, the public genetic databases are both biased towards pathogenic cellular organisms and deficient in virus sequences, thereby increasing the false positive rate for cellular pathogens while simultaneously raising the false negative rate for viral infections in de novo identifications.
Here we show that false positives are a significant error source in palaeopathological analyses using high-throughput sequencing. Using two platforms (the Helicos BioSciences HeliScope and the Illumina HiSeq 2500), we attempted to identify de novo the pathogen responsible for the huey cocoliztli (‘Great Pestilence’ in Nahautl), a disease that ravaged the native Mexican population during the Spanish colonial period .
In order to identify the huey cocoliztli pathogen, we analyzed the site of Teposcolula Yucundaa (Oaxaca, Mexico), a large Mixtec city at the time of Spanish colonization . The site was abandoned in 1552 following a major outbreak of the huey cocoliztli in the 1540 s. Excavations at Teposcolula Yucundaa conducted between 2004 and 2006 revealed two Mixtec cemeteries: a colonial period graveyard (the Grand Plaza) and a smaller precolonial one (the Churchyard) . Warinner and colleagues  identified the Grand Plaza graveyard as a plague pit corresponding to the 1540 s pandemic of the huey cocoliztli. The precontact Churchyard population is assumed to have been uninfected with the huey cocoliztli since the native Mexican populations were unfamiliar with the disease at the time of its first appearance during the colonial period . DNA preservation at Teposcolula Yucundaa is exceptional, probably due to the high-altitude, cool environment and relatively recent date .
Twelve femoral cortical bone specimens (each representing a unique individual; Table 1) were collected in the field and ground to a fine power (described in ). DNA extractions were performed in a dedicated ancient DNA laboratory in the Department of Human Evolutionary Biology, Harvard University according to standard anti-contamination protocols . Approximately 1 g of bone powder per individual was decalcified in 0.5 M EDTA, pH 8.0. Raw DNA extract was passed through a vacuum filter to remove residual protein and powder and then concentrated to ~500 μl via ultrafiltration using Vivaspin® 20’s (Vivaproducts) with a 10 kDa molecular weight cut-off. Concentrated extracts were treated with proteinase K. Protein-digestion completion was verified using a Qubit® 2.0 fluorometer (Life Technologies). Digested extracts were then purified using QIAquick PCR Purification Kits (Qiagen). Bulk DNA extracts consisting of 4 individuals per bulk were constructed for the Grand Plaza (individuals TP02, TP10, TP15, TP26) and Churchyard (individuals TP32, TP42, TP45, TP48) populations. Bulking the samples increased the likelihood of detecting the pathogen since only a fraction of the infected individuals are expected to have endogenous disease DNA preserved [e.g. [13, 14].
Three samples of soil from the Grand Plaza and the Churchyard cemeteries were also collected. Soil was collected from the burial contexts (within a few centimeters of the skeletons) by trowelling samples directly into collection bags while wearing gloves to limit DNA contamination. DNA was extracted from 5–8 g of soil per sample using the PowerMax® Soil DNA Isolation Kit (Mo-Bio Laboratories). Purified extracts from each burial were then combined into a bulk sample.
Finally, subsamples of all three bulk extracts (Churchyard, Grand Plaza and soil) were sheared to 150 bp average length using a S220 Focused-ultrasonicator (Covaris, Inc.) for subsequent true single molecule sequencing (tSMS). Although shearing DNA extracts is typically denigrated in high-throughput ancient DNA analyses (e.g. ), the sequences generated by tSMS are shorter (typically <40 bp) than the endogenous DNA molecules known to be present in the Teposcolula Yucundaa individuals by PCR (~100 bp)  and Bioanalyzer assays (see below). Since tSMS only generates one sequence per molecule and has limited ability to sequence through DNA lesions, shearing might make more of the endogenous DNA sequenceable (but at the cost of a possible increase in microbial contamination) by producing multiple 3′-termini per original molecule. Additionally, six extracts representing single individuals (TP04, TP09, TP10, TP18, TP37, TP48) and a subsample of the soil bulk extract were sheared to 150–200 bp average length for subsequent Illumina library construction using the automated Apollo 324 (IntegenX) platform.
DNA concentrations for all individual extracts, bulks and sheared bulks were calculated using a high-specificity DNA kit on a Qubit® fluorometer. Length distributions of DNA molecules in these samples were calculated using a high-specificity DNA chip on an Agilent 2100 Bioanalyzer (Agilent Biotechnologies).
Helicos HeliScope sequencing
A total of 12 channels of tSMS was performed on a Helicos HeliScope sequencer at Helicos BioSciences (Cambridge, MA). Eight μl per channel of each bulk and sheared bulk were prepared for tSMS according to the standard protocol for ancient samples . Two channels of each bulk and one channel of each sheared bulk were sequenced. Additionally, samples of the Churchyard and Grand Plaza bulk extracts were treated with Antarctic Phosphatase (New England Biolabs), diluted to the equivalent concentration of their untreated counterparts and prepared for HeliScope sequencing as described in . Phosphatase-treatment may increase HeliScope sequence yield . Two channels of each phosphatase-treated bulk were sequenced. The bulks and phosphatase-treated bulks were sequenced on one chip, while the soil and sheared bulks were sequenced on another. For the unsheared Churchyard and Grand Plaza bulks (both phosphatase-treated and untreated samples), 23–31 million quality-controlled sequences were generated per channel (Table 2). There were technical issues with the sheared and soil samples (see below), so only between 4,000 and 840,000 reads per channel were obtained for these four channels (Table 2).
Illumina HiSeq 2500 sequencing
Initial tSMS results (see below) revealed very high microbial species diversity in the bone and soil samples. The very short sequence length hampered accurate identification of these species. Therefore, we constructed Illumina libraries from six bone extracts representing single individuals and the bulk soil sample using the PrepX Illumina Kit (IntegenX) on the Apollo 324 according to manufacturer’s instructions using NEXTflex™ DNA barcodes (BioO Scientific). The indexed libraries were then enriched by 13 cycles of PCR using the NEXTflex™ kit. Library qualities were confirmed via fluorometric quantitation (Qubit®), analysis on a high-specificity DNA chip (Agilent 2100) and quantitative PCR using the KAPA Library Quantification Kit – Illumina/Universal (KAPA Biosystems). The libraries were then pooled in equimolar ratios, and paired-end 150 bp reads were generated on the Illumina HiSeq 2500 platform. Initial quality control and demultiplexing was performed using CASAVA 1.8.2. Paired-end reads were merged using PANDAseq 2.4.0 . Adapter artifacts and PCR duplicates were removed using TagDust 1.12  and CD-HIT 4.6 . The library sequencing qualities were checked using FastQC 1.32 . The final data sets contained 3.7–9.6 million quality-controlled reads per library (Table 2).
Sequence file formats were manipulated using SAMtools 0.1.18 . Sequence data sets were aligned against reference genomes of interest (Additional file 1, see Results and discussion below) using BWA 0.6.2 [39, 40] according to the recommended settings in . Sequences were also aligned against the GenBank® non-redundant nucleotide database using megaBLAST (BLAST 2.2.25+)  and analyzed in MEGAN 4.70.4 . Statistical analyses were performed in R 2.15.3  or using Biopieces  and custom scripts.
Results and discussion
Human DNA and authenticity
Alignment of the samples against the human (Homo sapiens) reference genome (GRCh37.p11) revealed that 0.66%–6.40% of the HeliScope reads in each channel and 0.00%–0.42% of Illumina sequences in each library corresponded to human DNA (Table 3, Additional file 1). The HeliScope endogenous human DNA quantifications are probably an underestimate of the true human DNA content since short sequences with low information contents are unlikely to map against the reference genome with sufficient confidence to be assigned (See ‘Soil Complexity’ below). Furthermore, the mean lengths (~25 bp) of the sequences mapped to the human genome were shorter than those (~34 bp) of all the sequences (mapped and unmapped included) in each channel (one-tailed t-test, p < 0.00001, Tables 2 and 3). As in , analysis of the mapped reads’ substitution artifacts using mapDamage2.0  showed a uniform distribution of C → T and G → A transitions along the molecules. These substitutions typically accumulate at the 5′- and 3′-termini of ancient DNA molecules respectively . These results suggest that the HeliScope chemistry cannot sequence through uracil lesions effectively, thereby shortening the recovered sequences of the endogenous ancient molecules and producing a uniform transition profile.
Due to the HeliScope sequences’ short read lengths, false positives due to the presence of contamination from closely related species (e.g. rodents) were a concern. Therefore, we also aligned the HeliScope data sets against the rat (Rattus norvegicus) genome (Rnor_5.0) and the chimpanzee (Pan troglodytes) genome (Pan_troglodytes-2.1.4) (Additional file 1). This ad-hoc test revealed a significant gradient of increasing numbers of hits against the genomes more closely related to humans (Kruskal-Wallis test, p = 0.0004609) with a concomitant increase in mean hit length in the more closely related species (Kruskal-Wallis test, p = 0.007629), suggesting that the HeliScope human-aligned reads derived from humans (Additional file 2).
Human DNA (1.12%–2.35% of all reads) was also identified in the soil HeliScope data sets, raising the possibility of laboratory contamination affecting our data. Nevertheless, analysis of the same soil bulk sample via Illumina sequencing revealed only 37 sequences (0.00%) that matched humans in the soil at the longer lengths (150–298 bp) obtained in the Illumina data. Comparatively, the individual bones had 300–27,430 human DNA sequences (0.01–0.42%) at equivalent lengths (Table 3). This indicates that the human DNA in the soil has undergone more degradation than that in the bone, a result inconsistent with the laboratory contamination hypothesis . Moreover, although mitochondrial haplotype profiles were incomplete due to absence of a mitochondrial DNA enrichment step, the individual haplotypes (based on the Illumina sequences) generally agreed with those found in , although a high rate of nucleotide misincorporations was observed (Additional file 3). Analysis of these substitutions using mapDamage2.0 showed the characteristic ancient DNA pattern, with higher rates of C → T and G → A transitions at the 5′- and 3′-termini respectively . This suggests that the human DNA in the bone samples is endogenous. It is possible that the soil human DNA derives from archaeologists in the field, although this is relatively unlikely since the bones and soil were collected while wearing gloves. Since the site was abandoned shortly after the huey cocoliztli pandemic , the most likely source of human DNA in the soil is not from later occupants of the site, but rather the numerous decomposing dead (estimated to at least 800 individuals) interred nearly simultaneously in the Grand Plaza cemetery. Detection of leached DNA is especially likely since the soil samples were collected immediately adjacent to the interred skeletons. If endogenous human DNA leached into the soil, it is possible that DNA from the huey cocoliztli pathogen may also have. However, since pathogen DNA is typically much rarer than host DNA [e.g. , this level of DNA leaching would probably be undetectable using current methodologies.
Based on historical descriptions of the huey cocoliztli, likely candidate pathogens include pneumonic plague (Yersinia pestis), typhus and other forms of rickettsiosis (Rickettsia spp.), smallpox or alastrim (Variola spp.), and viral hemorrhagic fevers (VHF), such as Dengue and Yellow Fever [26, 28, 48, 49]. Measles (Morbillivirus Measles virus), dysentery, influenza (Influenzavirus spp.), pneumonia and pleurisy have also been suggested, although these are less likely diagnoses since the historical descriptions of the huey cocolitzli symptoms do not match these diseases’ extant presentations . Since dysentery, pneumonia and pleurisy can be caused by a wide range of organisms and are unlikely diagnoses for the huey cocoliztli, we did not analyze them further via the candidate genome approach described here. Additionally, VHFs, measles and influenza are caused by RNA viruses, which are unlikely to survive in the archaeological record due to RNA’s instability and the ubiquity of RNases . Nevertheless, some have reported the successful amplification and sequencing of RNA viruses from medical archival preserved tissue dating back several decades [51, 52], and Fordyce and colleagues  reported high-throughput sequencing of RNA transcripts in 700-year-old maize kernels. Moreover, the HeliScope will sequence RNA directly (even using DNA settings for the machine as conducted here albeit with reduced efficiency), so there remains a finite detection possibility for RNA viruses.
A substantial proportion of the unsheared HeliScope reads (0.01%, 2198–3880 reads per channel) matched the Yersinia pestis genome in both the Grand Plaza and Churchyard populations (Additional file 1). This also matched the proportion of putative Y. pestis reads found in the HeliScope-sequenced soil data set (0.01%, 63 reads) (Table 4). BLAST analysis of the mapped reads in MEGAN showed that these identifications were non-specific, with the sequences matching a wide variety of organisms, suggesting these mapped reads are false positives (Additional file 4). According to the BLAST identifications, only 0.29% and 2.0% of the reads mapped to the Yerinia pestis genome by BWA matched the Enterobacteriaceae and Gammaproteobacteria respectively, Yersinia’s family and class.
Alignment of the unsheared Churchyard and Grand Plaza bulk HeliScope data against typhus (Rickettsia prowazekii and Rickettsia typhi) genomes found that 0.00%–0.01% (1,210–2,089 reads) corresponded to these species (Table 5, Additional file 1). The identification of Rickettsia in bone matches its prevalence in the HeliScope-sequenced soil sample (0.00%, 30 reads). It is possible that the putative Rickettsia DNA in the soil derives from leaching from the bone. However, this explanation is improbable since the quantity of leached pathogen DNA is expected to be too small to detect given the low amounts of soil human DNA [e.g. . Moreover, there were no differences between the Grand Plaza and Churchyard graveyards in terms of these species’ prevalences or lengths of the aligned sequences (30 ± 6 bp). Only one read corresponding to Rickettsia was found in the Illumina data sets. BLAST analysis of the Rickettsia hits showed that these sequences derived from various soil organisms rather than bona fide pathogens (Additional file 5). According to BLAST analyses, no reads matched the Rickettsiales and only 1.3% belonged to the Alphaproteobacteria order. We thus find no evidence that rickettsiosis is responsible for the huey cocoliztli.
A negligible number (11–30 reads per channel in the unsheared HeliScope bone data sets, none in the Illumina data sets) of reads mapped onto the Variola genome, the one candidate DNA virus (Table 6, Additional file 1). Alignment against the RNA virus candidates, including all four primary families of VHF (Arenaviridae, Bunyaviridae, Filoviridae, and Flaviviridae), measles and influenza, also yielded negative results (<120 reads per channel per genome, Tables 7,8,9,10,11 and 12, Additional file 1).
In conclusion, alignment of the HeliScope and Illumina data sets compared against the candidate disease reference genomes produced inconclusive results. While it is possible that the huey cocoliztli is not preserved in femoral cortical bone, historical documents describe a systemic infection with symptoms including hemorrhage and ulceration in multiple organs . It is therefore likely that the pathogen would be preserved in all vascularized tissues. Additionally, due to the instability of the RNA molecule and the paucity of the virus sequence databases, negative viral results are currently difficult to evaluate. Our results demonstrate that false positives are a serious problem for analyses identifying molecules via alignment against reference genomes and for analyses that omit sequencing archaeological controls. Nevertheless, capture of complete species-specific diagnostic sequences and genomes (e.g. [13, 14]) may be a viable method for isolating and verifying ancient pathogen DNA in the absence of these controls.
Currently, few high-throughput palaeopathogen DNA analyses have been conducted. The majority of studies have been PCR-based, an approach whose limitations are well documented [e.g. [21–23]. Although it is difficult to compare different archaeological collections and data sets in terms of the likelihood of ancient pathogen recovery, our data are illustrative of the challenges encountered in high-throughput metagenomic pathogen studies. They emphasize the level of analytic rigor and proof required to authenticate ancient pathogen analyses. In light of our findings, some previous metagenomic ancient DNA research claiming the discovery of pathogen-derived DNA may need further verification. For instance, although Thèves and colleagues  and Khairat and colleagues  report identifying Bordetella sp., Streptococcus pneumoniae and Shigella dystenteriae and Plasmodium falciparum and Toxoplasma gondii respectively, neither group sequenced archaeological controls such as soil or mummy wrappings. Thèves and colleagues amplified and sequenced microbial barcode 16S genes, increasing their results’ reliability since these genes are well characterized across a wide variety of species. Khairat and colleagues, however, based their apicomplexan identifications on MEGAN analysis, with no attempt to evaluate the species-specificity of the sequenced molecules. These identifications are therefore suspect since the genetic databases are skewed towards pathogenic members of this lineage. Additionally, a recent study by Chan and colleagues  claiming the identification of multiple strains of pathogenic tuberculosis (Mycobacterium tuberculosis) through non-targeted metagenomic sequencing has demonstrated insufficient analytical rigor to support their conclusions. The authors aligned their sequences against a single strain of pathogenic tuberculosis, but did not account for misalignments or environmental contamination with ubiquitous soil mycobacteria. Chan and colleagues’ data merit reanalysis with appropriate environmental controls. We recommend that the authors of these three studies demonstrate the veracity of their findings using a targeted capture approach and further bioinformatic analysis.
Examination of the Teposcolula soil DNA in MEGAN revealed that the microenvironment is complex, making identification of species-of-interest difficult. While there were significant differences in the relative frequencies of microorganisms between the Grand Plaza and Churchyard samples (Additional file 6), these differences corresponded to variation in the distribution of environment-derived organisms across the site rather than pathogen-related frequency variation. For instance, we found significant differences in the prevalence of Viridiplantae and the Rhizobiaceae, which are almost certainly environmental contaminants.
Moreover, the vast majority of HeliScope sequences (~95%) are unknown due to both limitations in the databases and the non-specificity of the short sequences. Due to the limitations of the genetic databases, some authors have amplified 16S genes (the most studied microbial barcode genes) before high-throughput sequencing in order to derive more species-informative ancient DNA metagenomic data sets (e.g. [24, 55]). We have not conducted this PCR procedure here since it negates the advantages of single-molecule sequencing. Instead, since the majority of sequences in the soil Illumina data set are unknown, we attempted to use the Illumina data set as a pseudo-‘reference metagenome’ to identify environmental contaminants in our HeliScope data. However, only 1.13%–9.62% of the sequences in the Helicos data sets (including the HeliScope soil sequence pools) mapped onto the Illumina soil sequences (Table 13). Given our low estimate of endogenous human DNA, we would expect >99% of the HeliScope reads (or 100% in the case of the soil reads) to map onto the Illumina soil data set. This indicates that the HeliScope is generating sequences with too low an information content to align against reference genomes with sufficient precision to be considered likely matches.
DNA length, GC content and sequencing yields
Agilent 2100 Bioanalyzer profiles of the unsheared DNA extracts revealed the typical bimodal distribution of molecular lengths, with a large primary peak between 1000 and 10,000 bp and a small secondary peak around 70 bp. These peaks have previously been determined to correspond primarily to microbial contamination (large peak) and a mixture of fragmented contaminant and authentic ancient DNA (small peak) respectively [20, 32, 56]. Shearing the extracts produced unimodal distributions with modal lengths between 150 and 200 bp.
Shearing of the DNA extracts reduced the total sequence yield drastically (5400–6800-fold reduction in the bone and 6.8-fold reduction in the soil), but yielded a concomitant 8.0–10.6-fold enrichment in the proportion of sequences mapping to the human genome in the bone samples and a 2.3-fold enrichment in human sequences in the soil. The cause of this yield reduction remains unclear. Variation between runs cannot completely explain it since the sheared and unsheared soil samples were sequenced on the same chip and samples from unrelated projects on the same chip had typical HeliScope sequencing yields [D. Jones, Pers. Comm. 2013]. Hypothetically, oxidative products (such as 8-oxoguanine and abasic sites) near the 3′-terminus of the sheared DNA molecule could disrupt the downstream poly (A)-tailing reaction and reduce the overall sequencing yield. Costello and colleagues  noted that shearing DNA samples to 150 bp lengths using the Covaris instrument produced 8-oxoguanine lesions yielding C → A transversion artifacts in downstream Illumina library production. Nevertheless, 8-oxoguanine lesions were rare and were only observable in libraries constructed from small initial DNA inputs. 8-oxoguanine lesions thus seem an implausible explanation for our results, although the production of abasic sites at the 3′-terminus cannot be ruled out.
An alternative explanation is that the shearing caused a decrease in the percentage of denaturable DNA molecules. DNA molecules that cannot be rendered single-stranded are unsequenceable on the HeliScope platform, thus reducing the effective input DNA concentration. One mechanism could be a relative increase in GC-rich sequences in the sheared samples, thus increasing the energy required for denaturation of these molecules. Shearing longer GC-rich DNAs into smaller and more numerous molecules would increase the apparent GC content since this would raise the relative number of available 3′-termini from the GC-rich sequences while the shorter AT-rich sequences would not shear into as many sequenceable fragments. Although the HeliScope platform is noted for its improved performance in both GC- and AT-rich regions in comparison to other high-throughput technologies, it is not impervious to these biochemical effects. This would cause an asymmetric denaturation step with only shorter and/or AT-rich molecules becoming single-stranded in the pool. An alternative, similar mechanism is that the sheared DNA molecules (which mostly derive from the high-molecular-weight peak) are more likely to be cross-linked to proteins and themselves, thus preventing denaturation and sequencing. Since the shearing raised the average length of the molecules (in comparison to the smaller ‘ancient DNA’ peak), this indicates that a far greater percentage of the sequenced molecules in the sheared pools were derived from the high-molecular-weight ‘microbial contaminant’ peak than from the ‘ancient DNA’ peak. Many microbial metagenomes have been noted to be particularly GC-rich. Our samples have a high GC content at baseline (mean 62.4%, standard deviation 1.1% in the unsheared bone and mean 62.1% in the soil). Shearing reduced the apparent GC content to 46.4% (standard deviation 1.0%) in the bones and a 58.1% in the soil. Moreover, there is a decrease in molecular length in comparison to their unsheared counterparts (median lengths of 26–30 bp in the sheared samples versus 32–34 bp in the unsheared samples). The AT enrichment, reduction in molecular length and increased percentage of human molecules are consistent with the denaturation hypothesis since the sequenced sheared molecules would on average have lower GC contents and be shorter, thus increasing the likelihood of sequencing endogenous molecules.
Antarctic phosphatase treatment
Ginolhac and colleagues  reported a 7.5–9.7-fold enrichment in the total sequence yield in their tSMS data after treatment of samples with Antarctic phosphatase. Conversely, we observed no such enrichment in our data; phosphatase-treated samples had 80%–104% relative yield of their untreated counterparts. They also documented a decrease in the proportion of endogenous sequences in the phosphatase-treated samples (34%–50% relative yield compared to untreated samples). In general agreement with Ginolhac and colleagues’ results, we observed a decrease in endogenous human sequences after phosphatase treatment (70%–90% relative yield). The discrepancy in overall yields may be due to differences in the concentrations of sequenceable molecules in the DNA extracts, such as those possibly caused by varying burial environments (Pleistocene permafrost versus Mexican highlands), GC contents between the sample (~40% in their data versus ~60% in our individuals) and extraction procedures. If our samples had greater concentrations of sequenceable molecules (which is likely since our untreated sample yields exceeded those of Ginolhac and colleagues by ~100-fold), it is possible that we saturated the channel and thus limited the effects of phosphatase treatment. The discrepancy between Ginolhac and colleagues’ and our phosphatase treatment results could also be a statistical artifact. Due to the large number (typically 10–20 million) of sequences generated in each HeliScope channel and the significant variation between replicate runs (relative standard deviations ranged between 0.66% and 6.5%), the balance of effect sizes versus discriminatory power is difficult to optimize.
Environmental contamination is a critical issue in ancient DNA investigations of diseases. Both false positives and false negatives are a serious problem. Previous investigations of ancient pathogens using both PCR and high-throughput sequencing may need to be re-evaluated because pathogen identifications require extensive verification procedures due to the high risk of false positives. Currently, molecular enrichment to isolate molecules of interest from the complex background (via either hybridization to probes or amplification of 16S regions) is the most promising route for ancient pathogen studies.
Furthermore, although the benefits of single-molecule sequencing are promising, methodological challenges remain in its application to ancient DNA research. We found that tSMS is not immune to GC-content-related biases and that the benefits of phosphatase-treatment are not universal. Analysis of these data remains complex due to the short lengths of the sequenced molecules and limitations of public databases. Bioinformatic methods developed for modern DNA analysis and for longer sequences are not appropriate for HeliScope ancient DNA data.
Finally, further methodological development is required for the identification of ancient pathogens to become routine and reliable. Until these techniques become available, the burden-of-proof is on the researchers reporting the discovery of these disease agents to demonstrate their results’ authenticity.
Availability of supporting data
The data sets supporting the results of this article are available in the National Center for Biotechnology Information Sequence Read Archive, [SRP022977: http://www.ncbi.nlm.nih.gov/sra].
Ancient deoxyribonucleic acid
Polymerase chain reaction
True single-molecule sequencing
Viral hemorrhagic fever.
Papagrigorakis MJ, Yapijakis C, Synodinos PN, Baziotopoulou-Valavani E: DNA examination of ancient dental pulp incriminates typhoid fever as a probable cause of the Plague of Athens. Int J Infect Dis. 2006, 10: 206-214. 10.1016/j.ijid.2005.09.001.
Shapiro B, Rambaut A, Gilbert MTP: No proof that typhoid caused the Plague of Athens (a reply to Papagrigorakis et al.). Int J Infect Dis. 2006, 10: 334-335. 10.1016/j.ijid.2006.02.006.
Drancourt M, Aboudharam G, Signoli M, Dutour O, Raoult D: Detection of 400-year-old Yersinia pestis DNA in human dental pulp: an approach to the diagnosis of ancient septicemia. Proc Natl Acad Sci USA. 1998, 95: 12637-12640. 10.1073/pnas.95.21.12637.
Zink AR, Sola C, Reischl U, Grabner W, Rastogi N, Wolf H, Nerlich AG: Characterization of Mycobacterium tuberculosis complex DNAs from Egyptian mummies by spoligotyping. J Clin Microbiol. 2003, 41: 359-367. 10.1128/JCM.41.1.359-367.2003.
Nguyen-Hieu T, Aboudharam G, Signoli M, Rigeade C, Drancourt M, Raoult D: Evidence of a louse-born outbreak involving typhus in Douai, 1710–1712 during the War of Spanish Succession. PLoS One. 2010, 5: e15405-10.1371/journal.pone.0015405.
Gernaey AM, Minnikin DE, Copley MS, Dixon RA, Middleton JC, Roberts CA: Mycolic acids and ancient DNA confirm an osteological diagnosis of tuberculosis. Tuberculosis. 2001, 81: 259-265. 10.1054/tube.2001.0295.
Kolman CJ, Centurion-Lara A, Lukehart SA, Owsley DA, Tuross N: Identification of Treponema pallidum subspecies pallidum in a 200-year-old skeletal specimen. J Infect Dis. 2060–2063, 1999: 180-
Bianucci R, Rahalison L, Ferroglio E, Massa ER, Signoli M: Détection de l’antigène F1 de Yersinia pestis dans les restes humains anciens à l’aide d’un test de diagnostic rapide. C R Biol. 2007, 330: 747-754. 10.1016/j.crvi.2007.07.007.
Bianucci R, Rahalison L, Massa ER, Peluso A, Ferroglio E, Signoli M: Technical note: a rapid diagnostic test detects plague in ancient human remains: an example of the interaction between archaeological and biological approaches (Southeastern France, 16th‒18th centuries). Am J Phys Anthropol. 2008, 136: 361-367. 10.1002/ajpa.20818.
Bianucci R, Rahalison L, Peluso A, Massa ER, Ferroglio E, Signoli M, Langlois J-Y, Gallien V: Plague immunodetection in remains of religious exhumed from burial sites in central France. J Arch Sci. 2009, 36: 616-621. 10.1016/j.jas.2008.10.007.
Haensch S, Bianucci R, Signoli M, Rajerison M, Schultz M, Kacki S, Vermunt M, Weston DA, Hurst D, Achtman M, Carniel E, Bramanti B: Distinct clones of Yersinia pestis caused the Black Death. PLoS Pathog. 2010, 6: e1001134-10.1371/journal.ppat.1001134.
Kacki S, Rahalison L, Rajerison M, Ferroglio E, Bianucci R: Black Death in the rural cemetery of Saint-Laurent-de-la-Cabrerisse Aude-Languedoc, southern France, 14th century: immunological evidence. J Arch Sci. 2011, 38: 581-587. 10.1016/j.jas.2010.10.012.
Schuenemann VJ, Bos K, DeWitte S, Schmedes S, Jamieson J, Mittnik A, Forrest S, Coombes BK, Wood JW, Earn DJD, White W, Krause J, Poinar HN: Targeted enrichment of ancient pathogens yielding the pPCP1 plasmid of Yersinia pestis from victims of the Black Death. Proc Natl Acad Sci USA. 2011, 108: E746-E752. 10.1073/pnas.1105107108.
Bos KI, Schuenemann VJ, Golding GB, Burbano HA, Waglechner N, Coombes BK, McPhee JB, DeWitte SN, Meyer M, Schmedes S, Wood J, Earn DJD, Herring DA, Bauer P, Poinar HN, Krause J: A draft genome of Yersinia pestis from victims of the Black Death. Nature. 2011, 478: 506-510. 10.1038/nature10549.
Barnes I, Thomas MG: Evaluating bacterial pathogen DNA preservation in museum osteological collections. P Roy Soc B. 2006, 273: 645-653. 10.1098/rspb.2005.3339.
Bouwman AS, Brown TA: The limits of biomolecular palaeopathology: ancient DNA cannot be used to study venereal syphilis. J Arch Sci. 2005, 32: 703-713. 10.1016/j.jas.2004.11.014.
Gilbert MTP, Cuccui J, White W, Lynnerup N, Titball RW, Cooper A, Prentice MB: Absence of Yersinia pestis-specific DNA in human teeth from five European excavations of putative plague victims. Microbiology. 2004, 150: 341-354. 10.1099/mic.0.26594-0.
Gilbert MTP, Rudbeck L, Willerslev E, Hansen AJ, Smith C, Penkman KEH, Prangenberg K, Nielsen-Marsh CM, Jans ME, Arthur P, Lynnerup N, Turner-Walker G, Biddle M, Kjølbye-Biddle B, Collins MJ: Biochemical and physical correlates of DNA contamination in archaeological human bones and teeth excavated at Matera, Italy. J Arch Sci. 2005, 32: 785-793. 10.1016/j.jas.2004.12.008.
Gilbert MTP, Hansen AJ, Willerslev E, Turner-Walker G, Collins M: Insights into the processes behind the contamination of degraded human teeth and bone samples with exogenous sources of DNA. Int J Osteoarchaeol. 2006, 16: 156-164. 10.1002/oa.832.
Pääbo S, Poinar H, Serre D, Jaenicke-Després V, Hebler J, Rohland N, Kuch M, Krause J, Vigilant L, Hofreiter M: Genetic analyses from ancient DNA. Annu Rev Genet. 2004, 38: 645-679. 10.1146/annurev.genet.37.110801.143214.
Drancourt M, Raoult DR: Palaeomicrobiology: current issues and perspectives. Nat Rev Microbiol. 2005, 3: 23-35. 10.1038/nrmicro1063.
Roberts C, Ingham S: Using ancient DNA analysis in palaeopathology: a critical analysis of published papers and recommendations for future work. Int J Osteoarchaeol. 2008, 18: 600-613. 10.1002/oa.966.
Tsangaras K, Greenwood AD: Museums and disease: using tissue archive and museum samples to study pathogens. Ann Anat. 2012, 194: 58-73. 10.1016/j.aanat.2011.04.003.
Thèves C, Senescau A, Vanin S, Keyser C, Ricaut FX, Alekseev AN, Dabernat H, Ludes B, Fabre R, Crubézy : Molecular identification of bacteria by total sequence screening: determining the cause of death in ancient human subjects. PLoS One. 2011, 6: e21733-10.1371/journal.pone.0021733.
Knapp M, Hofreiter M: Next generation sequencing of ancient DNA: requirements, strategies and perspectives. Genes. 2010, 1: 227-243. 10.3390/genes1020227.
Acuna-Soto R, Stahle DW, Therrell MD, Griffin RD, Cleaveland MK: When half of the population died: the epidemic of hemorrhagic fevers of 1576 in Mexico. FEMS Microbiol Lett. 2004, 240: 1-5. 10.1016/j.femsle.2004.09.011.
Warinner C, Robles García N, Spores R, Tuross N: Disease, demography, and diet in early colonial New Spain: investigation of a sixteenth-century Mixtec cemetery at Teposcolula Yucundaa. Lat Am Antiq. 2012, 23: 467-489. 10.7183/1045-66184.108.40.2067.
Marr JS, Kiracofe JB: Was the huey cocoliztli a haemorrhagic fever?. Med Hist. 2000, 44: 341-362.
Warinner CG: PhD thesis. Life and Death at Teposcolula Yucundaa: Mortuary, Archaeogenetic, and Isotopic Investigations of the Early Colonial Period in Mexico. 2010, Harvard University, Human Evolutionary Biology Department
Cooper A, Poinar HN: Ancient DNA: do it right or not at all. Science. 2000, 289: 1139-
Debruyne R, Schwarz C, Poinar H: Comment on ‘Whole-genome shotgun sequencing of mitochondria from ancient hair shafts”. Science. 2008, 322: 857a-10.1126/science.1158417.
Orlando L, Ginolhac A, Raghavan M, Vilstrup J, Rasmussen M, Magnussen K, Steinmann KE, Kapranov P, Thompson JF, Zazula G, Froese D, Moltke I, Shapiro B, Hofreiter M, Al-Rasheid KAS, Gilbert MTP, Willerslev E: True single-molecule DNA sequencing of a pleistocene horse bone. Genome Res. 2011, 21: 1705-1719. 10.1101/gr.122747.111.
Ginolhac A, Vilstrup J, Stenderup J, Rasmussen M, Stiller M, Shapiro B, Zazula G, Froese D, Steinmann KE, Thompson JF, Al-Rasheid KAS, Gilbert TMP, Willerslev E, Orlando L: Improving the performance of true single molecule sequencing for ancient DNA. BMC Genomics. 2012, 13: 177-10.1186/1471-2164-13-177.
Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD: PANDAseq: PAired-eND Assembler for Illumina sequences. BMC Bioinformatics. 2012, 13: 31-10.1186/1471-2105-13-31.
Lassman T, Hayashizaki Y, Daub CO: TagDust—a program to eliminate artifacts from next generation sequencing data. Bioinformatics. 2009, 25: 2839-2840. 10.1093/bioinformatics/btp527.
Li W, Fu L, Niu B, Wu S, Wooley J: Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012, 13: 656-668. 10.1093/bib/bbs035.
Andrews S: Fastqc: A Quality Control Tool For High Throughput Sequence Data.http://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics. 2078–2079, 2009: 25-
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.
Li H, Durbin R: Fast and accurate long read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26: 589-595. 10.1093/bioinformatics/btp698.
Schubert M, Ginolhac A, Lindgreen S, Thompson JF, Al-Rasheid KAS, Willerslev E, Krogh A, Orlando L: Improving ancient DNA read mapping against modern reference genomes. BMC Genomics. 2012, 13: 178-10.1186/1471-2164-13-178.
Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.
Huson D, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC: Integrative analysis of environmental sequences using MEGAN4. Genome Res. 2011, 2011 (21): 1552-1560.
R Core Team: R: A Language And Environment For Statistical Computing. 2013, Vienna: R Foundation for Statistical Computing
Hansen MA, Oey H, Fernandez-Valverde S, Jung C-H, Mattick JS: Biopieces: A Bioinformatics Toolset and Framework.http://www.biopieces.org,
Jónsson H, Ginolhac A, Schubert M, Johnson P, Orlando L: mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics. 2013, 29 (13): 1682-1684. 10.1093/bioinformatics/btt193.
Briggs AW, Stenzel U, Johnson PLF, Green RE, Kelso J, Prüfer K, Meyer M, Krause J, Ronan MT, Lachmann M, Pääbo S: Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA. 2007, 104: 14616-14621. 10.1073/pnas.0704665104.
Acuna-Soto R, Calderon Romero L, Maguire JH: Large epidemics of hemorrhagic fevers in Mexico 1545–1815. Am J Trop Med Hyg. 2000, 62: 733-739.
Acuna-Soto R, Stahle DW, Cleaveland MK, Therrell MD: Megadrought and megadeath in 16th century Mexico. Emerg Infect Dis. 2002, 8: 360-362. 10.3201/eid0804.010175.
Gilbert MTP, Rambaut A, Wlasiuk G, Spira TJ, Pitchenik AE, Worobey M: The emergence of HIV/AIDS in the Americas and beyond. Proc Natl Acad Sci USA. 2007, 104: 18566-18570. 10.1073/pnas.0705329104.
Fordyce SL, Ávila-Arcos MC, Rasmussen M, Cappellini E, Romero-Navarro JA, Wales N, Alquezar-Planas DE, Penfield S, Brown TA, Vielle-Calzada J-P, Montiel R, Jørgensen T, Odegaard N, Jacobs M, Arriaza B, Higham TFG, Bronk Ramsey C, Willerslev E, Gilbert MTP: Deep sequencing of RNA from ancient maize kernels. PLoS One. 2013, 8: e50961-10.1371/journal.pone.0050961.
Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, Bunce M, Muyembe J-J, Kabongo J-MM, Kalengayi RM, Van Marck E, Gilbert MTP, Wolinksy SM: Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature. 2008, 455: 661-664. 10.1038/nature07390.
Khairat R, Ball M, Chang C-CH, Bianucci R, Nerlich AG, Trautmann M, Ismail S, Shanab GML, Karim AM, Gad YZ, Pusch CM: First insights into the metagenome of Egyptian mummies using next-generation sequencing. J Appl Genetics. 2013, 54: 309-325. 10.1007/s13353-013-0145-1.
Chan JZ-M, Sergeant MJ, Lee OY-C, Minnikin DE, Besra GS, Pap I, Spigelman M, Donoghue HD: Metagenomic analysis of tuberculosis in a mummy. N Engl J Med. 2013, 369: 289-290. 10.1056/NEJMc1302295.
Adler CJ, Dobney K, Weyrich LS, Kaidonis J, Walker AW, Haak W, Bradshaw CJA, Townsend G, Sołtysiak A, Alt KW, Parkhill J, Cooper A: Sequencing ancient calcified dental plaque shows changes in oral microbiota with dietary shifts of the Neolithic and Industrial revolutions. Nat Genet. 2013, 45: 450-455. 10.1038/ng.2536.
Malmström H, Svensson EM, Gilbert MTP, Willerslev E, Götherström A, Holmlund G: More on contamination: the use of asymmetric molecular behavior to identify authentic ancient human DNA. Mol Biol Evol. 2007, 24: 998-1004. 10.1093/molbev/msm015.
Costello M, Pugh TJ, Fennell TJ, Stewart C, Lichtenstein L, Meldrim JC, Fostel JF, Friedrich DC, Perrin D, Dionne D, Kim S, Gabriel SB, Lander ES, Fisher S, Getz G: Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 2013, 41: e67-10.1093/nar/gks1443.
The Instituto Nacional de Antropología e Historia graciously supplied specimens for analysis. Christina Warinner collected bone and soil specimens in the field. Aurelien Ginolhac and Ludovic Orlando made an unpublished version of their manuscript available to us. Dan Jones (Helicos BioSciences) provided technical expertise. The Harvard Initiative for the Science of the Human Past (Harvard University) funded the Illumina sequencing. Linda Reynard helpfully commented on the manuscript. The Bauer Core (Faculty of Arts and Sciences Center for Systems Biology, Harvard University) and Research Computing (Faculty of Arts and Sciences, Harvard University) provided facilities and technical expertise. The Mäxi Foundation and the Faculty of Arts and Sciences (Harvard University) supported MGC. NRG was funded by the Instituto Nacional de Antropología e Historia. The Mäxi Foundation supported FJR. The Faculty of Arts and Sciences (Harvard University) and the David Rockefeller Center for Latin American Studies supported NT.
The authors declare that they have no competing interests.
MGC conducted the molecular genetic analyses, performed the bioinformatic analysis and drafted the manuscript. NRG provided samples and conceived the project. FJR participated in the coordination and design of the project. NT performed molecular experiments and conceived, coordinated and led the project. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 3:Concordance of the mitochondrial haplotypes between our Illumina data sets and those reported in Warinner.(DOCX 64 KB)
Additional file 4:MEGAN analysis of the reads mapped to the Yersinia pestis genome. The reads are non-specific for the pathogen. (PDF 7 KB)
Additional file 5:MEGAN analysis of the reads mapped to Rickettsia genomes. The reads are non-specific for the pathogens. Instead, they correspond to common environmental organisms. (PDF 4 KB)
Additional file 6:MEGAN comparison of the BLAST hits between the Grand Plaza (blue) and Churchyard (red) populations. Significant differences in relative species prevalence are highlighted in black. The vast majority of the DNA sequences are derived from the environment, obscuring any ancient pathogen signal. (PNG 13 KB)
About this article
Cite this article
Campana, M.G., Robles García, N., Rühli, F.J. et al. False positives complicate ancient pathogen identifications using high-throughput shotgun sequencing. BMC Res Notes 7, 111 (2014). https://doi.org/10.1186/1756-0500-7-111
- True single molecule sequencing
- High-throughput sequencing
- Ancient DNA
- False positive