False positives complicate ancient pathogen identifications using high-throughput shotgun sequencing

Background Identification of historic pathogens is challenging since false positives and negatives are a serious risk. Environmental non-pathogenic contaminants are ubiquitous. Furthermore, public genetic databases contain limited information regarding these species. High-throughput sequencing may help reliably detect and identify historic pathogens. Results We shotgun-sequenced 8 16th-century Mixtec individuals from the site of Teposcolula Yucundaa (Oaxaca, Mexico) who are reported to have died from the huey cocoliztli (‘Great Pestilence’ in Nahautl), an unknown disease that decimated native Mexican populations during the Spanish colonial period, in order to identify the pathogen. Comparison of these sequences with those deriving from the surrounding soil and from 4 precontact individuals from the site found a wide variety of contaminant organisms that confounded analyses. Without the comparative sequence data from the precontact individuals and soil, false positives for Yersinia pestis and rickettsiosis could have been reported. Conclusions False positives and negatives remain problematic in ancient DNA analyses despite the application of high-throughput sequencing. Our results suggest that several studies claiming the discovery of ancient pathogens may need further verification. Additionally, true single molecule sequencing’s short read lengths, inability to sequence through DNA lesions, and limited ancient-DNA-specific technical development hinder its application to palaeopathology.


Background
Diseases have ravaged human populations throughout history. The pathogens responsible for many historic epidemics remain either unknown or speculative (e.g. the Plague of Athens [1,2]). Identification of these historic pathogens is critical to understanding disease evolution, which in turn has direct impacts on the development of effective medical treatments for these conditions. The discovery that ancient pathogen-diagnostic biomolecules may survive in archaeological bones and the development of high-throughput DNA sequencing technologies have opened new possibilities for determining past disease [3][4][5][6][7][8][9][10][11][12][13][14].
Furthermore, PCR-based methods are limited in their ability to identify unknown pathogens since they require candidate disease agents of known sequence in advance. Although there are a few studies that attempt such de novo identification (e.g. [5,24]), the vast majority of investigations have been restricted to targeted, likely diseases. While DNA capture methods, including in-solution [e.g. 13] and array-hybridization [e.g. 14] capture, have become the gold standard for palaeopathogenomics, no such technology is yet available for palaeopathogen diagnosis. In theory, it is possible to identify unknown diseases by high-throughput DNA sequencing of known afflicted individuals and comparing these pools with suitable controls (such as the surrounding soil and uninfected archaeologically related individuals). Organisms found only in the infected population would thus be candidates for the responsible pathogens. Nevertheless, since the majority of sequences in most ancient DNA samples are derived from the environment (particularly soil bacteria), false positives could remain problematic in high-throughput analyses [23,25]. Moreover, the public genetic databases are both biased towards pathogenic cellular organisms and deficient in virus sequences, thereby increasing the false positive rate for cellular pathogens while simultaneously raising the false negative rate for viral infections in de novo identifications.
Here we show that false positives are a significant error source in palaeopathological analyses using highthroughput sequencing. Using two platforms (the Helicos BioSciences HeliScope and the Illumina HiSeq 2500), we attempted to identify de novo the pathogen responsible for the huey cocoliztli ('Great Pestilence' in Nahautl), a disease that ravaged the native Mexican population during the Spanish colonial period [26].

DNA extraction
In order to identify the huey cocoliztli pathogen, we analyzed the site of Teposcolula Yucundaa (Oaxaca, Mexico), a large Mixtec city at the time of Spanish colonization [27]. The site was abandoned in 1552 following a major outbreak of the huey cocoliztli in the 1540 s. Excavations at Teposcolula Yucundaa conducted between 2004 and 2006 revealed two Mixtec cemeteries: a colonial period graveyard (the Grand Plaza) and a smaller precolonial one (the Churchyard) [27]. Warinner and colleagues [27] identified the Grand Plaza graveyard as a plague pit corresponding to the 1540 s pandemic of the huey cocoliztli. The precontact Churchyard population is assumed to have been uninfected with the huey cocoliztli since the native Mexican populations were unfamiliar with the disease at the time of its first appearance during the colonial period [28]. DNA preservation at Teposcolula Yucundaa is exceptional, probably due to the high-altitude, cool environment and relatively recent date [27].
Twelve femoral cortical bone specimens (each representing a unique individual; Table 1) were collected in the field and ground to a fine power (described in [29]). DNA extractions were performed in a dedicated ancient DNA laboratory in the Department of Human Evolutionary Biology, Harvard University according to standard anticontamination protocols [30]. Approximately 1 g of bone powder per individual was decalcified in 0.5 M EDTA, pH 8.0. Raw DNA extract was passed through a vacuum filter to remove residual protein and powder and then concentrated to~500 μl via ultrafiltration using Vivaspin® 20's (Vivaproducts) with a 10 kDa molecular weight cut-off. Concentrated extracts were treated with proteinase K. Protein-digestion completion was verified using a Qubit® 2.0 fluorometer (Life Technologies). Digested extracts were then purified using QIAquick PCR Purification Kits (Qiagen). Bulk DNA extracts consisting of 4 individuals per bulk were constructed for the Grand Plaza  (individuals TP02, TP10, TP15, TP26) and Churchyard  (individuals TP32, TP42, TP45, TP48) populations. Bulking the samples increased the likelihood of detecting the pathogen since only a fraction of the infected individuals are expected to have endogenous disease DNA preserved [e.g. 13,14].
Three samples of soil from the Grand Plaza and the Churchyard cemeteries were also collected. Soil was collected from the burial contexts (within a few centimeters of the skeletons) by trowelling samples directly into collection bags while wearing gloves to limit DNA contamination. DNA was extracted from 5-8 g of soil per sample using the PowerMax® Soil DNA Isolation Kit (Mo-Bio Laboratories). Purified extracts from each burial were then combined into a bulk sample.
Finally, subsamples of all three bulk extracts (Churchyard, Grand Plaza and soil) were sheared to 150 bp average length using a S220 Focused-ultrasonicator (Covaris, Inc.) for subsequent true single molecule sequencing (tSMS). Although shearing DNA extracts is typically denigrated in high- Sample details are from [27,29].
throughput ancient DNA analyses (e.g. [31]), the sequences generated by tSMS are shorter (typically <40 bp) than the endogenous DNA molecules known to be present in the Teposcolula Yucundaa individuals by PCR (~100 bp) [27] and Bioanalyzer assays (see below). Since tSMS only generates one sequence per molecule and has limited ability to sequence through DNA lesions, shearing might make more of the endogenous DNA sequenceable (but at the cost of a possible increase in microbial contamination) by producing multiple 3′-termini per original molecule. Additionally, six extracts representing single individuals (TP04, TP09, TP10, TP18, TP37, TP48) and a subsample of the soil bulk extract were sheared to 150-200 bp average length for subsequent Illumina library construction using the automated Apollo 324 (IntegenX) platform. DNA concentrations for all individual extracts, bulks and sheared bulks were calculated using a high-specificity DNA kit on a Qubit® fluorometer. Length distributions of DNA molecules in these samples were calculated using a high-specificity DNA chip on an Agilent 2100 Bioanalyzer (Agilent Biotechnologies).

Helicos HeliScope sequencing
A total of 12 channels of tSMS was performed on a Helicos HeliScope sequencer at Helicos BioSciences (Cambridge, MA). Eight μl per channel of each bulk and sheared bulk were prepared for tSMS according to the standard protocol for ancient samples [32]. Two channels of each bulk and one channel of each sheared bulk were sequenced. Additionally, samples of the Churchyard and Grand Plaza bulk extracts were treated with Antarctic Phosphatase (New England Biolabs), diluted to the equivalent concentration of their untreated counterparts and prepared for HeliScope sequencing as described in [33]. Phosphatase-treatment may increase HeliScope sequence yield [33]. Two channels of each phosphatase-treated bulk were sequenced. The bulks and phosphatase-treated bulks were sequenced on one chip, while the soil and sheared bulks were sequenced on another. For the unsheared Churchyard and Grand Plaza bulks (both phosphatasetreated and untreated samples), 23-31 million qualitycontrolled sequences were generated per channel ( Table 2). There were technical issues with the sheared and soil samples (see below), so only between 4,000 and 840,000 reads per channel were obtained for these four channels ( Table 2).

Illumina HiSeq 2500 sequencing
Initial tSMS results (see below) revealed very high microbial species diversity in the bone and soil samples. The very short sequence length hampered accurate identification of these species. Therefore, we constructed Illumina libraries from six bone extracts representing single "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. The "No. Filtered Sequences" is the number of sequences after filtering for sequencing artifacts (e.g. PCR duplicates). "Molecular Length" is the mean length (± standard deviation) of the generated reads.
individuals and the bulk soil sample using the PrepX Illumina Kit (IntegenX) on the Apollo 324 according to manufacturer's instructions using NEXTflex™ DNA barcodes (BioO Scientific). The indexed libraries were then enriched by 13 cycles of PCR using the NEXTflex™ kit. Library qualities were confirmed via fluorometric quantitation (Qubit®), analysis on a high-specificity DNA chip (Agilent 2100) and quantitative PCR using the KAPA Library Quantification Kit -Illumina/Universal (KAPA Biosystems). The libraries were then pooled in equimolar ratios, and paired-end 150 bp reads were generated on the Illumina HiSeq 2500 platform. Initial quality control and demultiplexing was performed using CASAVA 1.8.2. Paired-end reads were merged using PANDAseq 2.4.0 [34]. Adapter artifacts and PCR duplicates were removed using TagDust 1.12 [35] and CD-HIT 4.6 [36]. The library sequencing qualities were checked using FastQC 1.32 [37]. The final data sets contained 3.7-9.6 million quality-controlled reads per library ( Table 2).

Bioinformatic analysis
Sequence file formats were manipulated using SAMtools 0.1.18 [38]. Sequence data sets were aligned against reference genomes of interest (Additional file 1, see Results and discussion below) using BWA 0.6.2 [39,40] according to the recommended settings in [41]. Sequences were also aligned against the GenBank® non-redundant nucleotide database using megaBLAST (BLAST 2.2.25+) [42] and analyzed in MEGAN 4.70.4 [43]. Statistical analyses were performed in R 2.15.3 [44] or using Biopieces [45] and custom scripts.

Results and discussion
Human DNA and authenticity Alignment of the samples against the human (Homo sapiens) reference genome (GRCh37.p11) revealed that 0.66%-6.40% of the HeliScope reads in each channel and 0.00%-0.42% of Illumina sequences in each library corresponded to human DNA (Table 3, Additional file 1). The HeliScope endogenous human DNA quantifications are probably an underestimate of the true human DNA content since short sequences with low information contents are unlikely to map against the reference genome with sufficient confidence to be assigned (See 'Soil Complexity' below). Furthermore, the mean lengths (~25 bp) of the sequences mapped to the human genome were shorter than those (~34 bp) of all the sequences (mapped and unmapped included) in each channel (one-tailed t-test, p < 0.00001, Tables 2 and 3). As in [32], analysis of the mapped reads' substitution artifacts using mapDamage2.0 [46] showed a "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto the human genome. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
uniform distribution of C → T and G → A transitions along the molecules. These substitutions typically accumulate at the 5′-and 3′-termini of ancient DNA molecules respectively [47]. These results suggest that the HeliScope chemistry cannot sequence through uracil lesions effectively, thereby shortening the recovered sequences of the endogenous ancient molecules and producing a uniform transition profile. Due to the HeliScope sequences' short read lengths, false positives due to the presence of contamination from closely related species (e.g. rodents) were a concern. Therefore, we also aligned the HeliScope data sets against the rat (Rattus norvegicus) genome (Rnor_5.0) and the chimpanzee (Pan troglodytes) genome (Pan_troglodytes-2.1.4) (Additional file 1). This ad-hoc test revealed a significant gradient of increasing numbers of hits against the genomes more closely related to humans (Kruskal-Wallis test, p = 0.0004609) with a concomitant increase in mean hit length in the more closely related species (Kruskal-Wallis test, p = 0.007629), suggesting that the HeliScope human-aligned reads derived from humans (Additional file 2).
Human DNA (1.12%-2.35% of all reads) was also identified in the soil HeliScope data sets, raising the possibility of laboratory contamination affecting our data. Nevertheless, analysis of the same soil bulk sample via Illumina sequencing revealed only 37 sequences (0.00%) that matched humans in the soil at the longer lengths (150-298 bp) obtained in the Illumina data. Comparatively, the individual bones had 300-27,430 human DNA sequences (0.01-0.42%) at equivalent lengths (Table 3). This indicates that the human DNA in the soil has undergone more degradation than that in the bone, a result inconsistent with the laboratory contamination hypothesis [30]. Moreover, although mitochondrial haplotype profiles were incomplete due to absence of a mitochondrial DNA enrichment step, the individual haplotypes (based on the Illumina sequences) generally agreed with those found in [29], although a high rate of nucleotide misincorporations was observed (Additional file 3). Analysis of these substitutions using mapDamage2.0 showed the characteristic ancient DNA pattern, with higher rates of C → T and G → A transitions at the 5′-and 3′-termini respectively [47]. This suggests that the human DNA in the bone samples is endogenous. It is possible that the soil human DNA derives from archaeologists in the field, although this is relatively unlikely since the bones and soil were collected while wearing gloves. Since the site was abandoned shortly after the huey cocoliztli pandemic [27], the most likely source of human DNA in the soil is not from later occupants of the site, but rather the numerous decomposing dead (estimated to at least 800 individuals) interred nearly simultaneously in the Grand Plaza cemetery. Detection of leached DNA is especially likely since the soil samples were collected immediately adjacent to the interred skeletons. If endogenous human DNA leached into the soil, it is possible that DNA from the huey cocoliztli pathogen may also have. However, since pathogen DNA is typically much rarer than host DNA [e.g. 13], this level of DNA leaching would probably be undetectable using current methodologies.

Candidate diseases
Based on historical descriptions of the huey cocoliztli, likely candidate pathogens include pneumonic plague (Yersinia pestis), typhus and other forms of rickettsiosis (Rickettsia spp.), smallpox or alastrim (Variola spp.), and viral hemorrhagic fevers (VHF), such as Dengue and Yellow Fever [26,28,48,49]. Measles (Morbillivirus Measles virus), dysentery, influenza (Influenzavirus spp.), pneumonia and pleurisy have also been suggested, although these are less likely diagnoses since the historical descriptions of the huey cocolitzli symptoms do not match these diseases' extant presentations [28]. Since dysentery, pneumonia and pleurisy can be caused by a wide range of organisms and are unlikely diagnoses for the huey cocoliztli, we did not analyze them further via the candidate genome approach described here. Additionally, VHFs, measles and influenza are caused by RNA viruses, which are unlikely to survive in the archaeological record due to RNA's instability and the ubiquity of RNases [50]. Nevertheless, some have reported the successful amplification and sequencing of RNA viruses from medical archival preserved tissue dating back several decades [51,52], and Fordyce and colleagues [50] reported high-throughput sequencing of RNA transcripts in 700-year-old maize kernels. Moreover, the HeliScope will sequence RNA directly (even using DNA settings for the machine as conducted here albeit with reduced efficiency), so there remains a finite detection possibility for RNA viruses.
A substantial proportion of the unsheared HeliScope reads (0.01%, 2198-3880 reads per channel) matched the Yersinia pestis genome in both the Grand Plaza and Churchyard populations (Additional file 1). This also matched the proportion of putative Y. pestis reads found in the HeliScope-sequenced soil data set (0.01%, 63 reads) ( Table 4). BLAST analysis of the mapped reads in MEGAN showed that these identifications were nonspecific, with the sequences matching a wide variety of organisms, suggesting these mapped reads are false positives (Additional file 4). According to the BLAST identifications, only 0.29% and 2.0% of the reads mapped to the Yerinia pestis genome by BWA matched the Enterobacteriaceae and Gammaproteobacteria respectively, Yersinia's family and class.
Alignment of the unsheared Churchyard and Grand Plaza bulk HeliScope data against typhus (Rickettsia prowazekii and Rickettsia typhi) genomes found that 0.00%-0.01% (1,210-2,089 reads) corresponded to these species (Table 5, Additional file 1). The identification of Rickettsia in bone matches its prevalence in the HeliScope-sequenced soil sample (0.00%, 30 reads). It is possible that the putative Rickettsia DNA in the soil derives from leaching from the bone. However, this explanation is improbable since the quantity of leached pathogen DNA is expected to be too small to detect given the low amounts of soil human DNA [e.g. 13]. Moreover, there were no differences between the Grand Plaza and Churchyard graveyards in terms of these species' prevalences or lengths of the aligned sequences (30 ± 6 bp). Only one read corresponding to Rickettsia was found in the Illumina data sets. BLAST analysis of the Rickettsia hits showed that these sequences derived from various soil organisms rather than bona fide pathogens (Additional file 5). According to BLAST analyses, no reads matched the Rickettsiales and only 1.3% belonged to the Alphaproteobacteria order. We thus find no evidence that rickettsiosis is responsible for the huey cocoliztli.
A negligible number (11-30 reads per channel in the unsheared HeliScope bone data sets, none in the Illumina data sets) of reads mapped onto the Variola genome, the one candidate DNA virus ( Table 6, Additional file 1). Alignment against the RNA virus candidates, including all four primary families of VHF (Arenaviridae, Bunyaviridae, Filoviridae, and Flaviviridae), measles and influenza, also yielded negative results (<120 reads per channel per genome, Tables 7,8,9,10,11 and 12, Additional file 1).
In conclusion, alignment of the HeliScope and Illumina data sets compared against the candidate disease reference genomes produced inconclusive results. While it is possible that the huey cocoliztli is not preserved in femoral cortical bone, historical documents describe a systemic infection with symptoms including hemorrhage and ulceration in multiple organs [48]. It is therefore likely that the pathogen would be preserved in all vascularized tissues. Additionally, due to the instability of the RNA molecule and the paucity of the virus sequence databases, negative viral results are currently difficult to evaluate. Our results demonstrate that false positives are a serious problem for analyses identifying molecules via alignment against reference genomes and for analyses that omit sequencing archaeological controls. Nevertheless, capture of complete species-specific diagnostic sequences and genomes (e.g. [13,14]) may be a viable method for isolating and verifying ancient pathogen DNA in the absence of these controls.
Currently, few high-throughput palaeopathogen DNA analyses have been conducted. The majority of studies have been PCR-based, an approach whose limitations are well documented [e.g. [21][22][23]. Although it is difficult to "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto the Yersinia pestis genome. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
"Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto the typhus genome. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
Soil Bulk 0 (0.00%) -0 "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto the Variola genome. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
"Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto Arenaviridae genomes. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
Soil Bulk 0 (0.00%) -0 "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto Bunyaviridae genomes. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
"Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto Filoviridae genomes. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
Soil Bulk 0 (0.00%) -0 "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto Flaviviridae genomes. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.
"Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto influenza genomes. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.  compare different archaeological collections and data sets in terms of the likelihood of ancient pathogen recovery, our data are illustrative of the challenges encountered in high-throughput metagenomic pathogen studies. They emphasize the level of analytic rigor and proof required to authenticate ancient pathogen analyses. In light of our findings, some previous metagenomic ancient DNA research claiming the discovery of pathogen-derived DNA may need further verification. For instance, although Thèves and colleagues [22] and Khairat and colleagues [53] report identifying Bordetella sp., Streptococcus pneumoniae and Shigella dystenteriae [22] and Plasmodium falciparum and Toxoplasma gondii [53] respectively, neither group sequenced archaeological controls such as soil or mummy wrappings. Thèves and colleagues amplified and sequenced microbial barcode 16S genes, increasing their results' reliability since these genes are well characterized across a wide variety of species. Khairat and colleagues, however, based their apicomplexan identifications on MEGAN analysis, with no attempt to evaluate the species-specificity of the sequenced molecules. These identifications are therefore suspect since the genetic databases are skewed towards pathogenic members of this lineage. Additionally, a recent study by Chan and colleagues [54] claiming the identification of multiple strains of pathogenic tuberculosis (Mycobacterium tuberculosis) through non-targeted metagenomic sequencing has demonstrated insufficient analytical rigor to support their conclusions. The authors aligned their sequences against a single strain of pathogenic tuberculosis, but did not account for misalignments or environmental contamination with ubiquitous soil mycobacteria. Chan and colleagues' data merit reanalysis with appropriate environmental controls. We recommend that the authors of these three studies demonstrate the veracity of their findings using a targeted capture approach and further bioinformatic analysis.

Soil complexity
Examination of the Teposcolula soil DNA in MEGAN revealed that the microenvironment is complex, making identification of species-of-interest difficult. While there were significant differences in the relative frequencies of microorganisms between the Grand Plaza and Churchyard samples (Additional file 6), these differences corresponded to variation in the distribution of environment-derived organisms across the site rather than pathogen-related frequency variation. For instance, we found significant differences in the prevalence of Viridiplantae and the Rhizobiaceae, which are almost certainly environmental contaminants. Moreover, the vast majority of HeliScope sequences (~95%) are unknown due to both limitations in the databases and the non-specificity of the short sequences. Due to the limitations of the genetic databases, some authors have amplified 16S genes (the most studied microbial barcode genes) before high-throughput sequencing in order to derive more species-informative ancient DNA metagenomic data sets (e.g. [24,55]). We have not conducted this PCR procedure here since it negates the advantages of single-molecule sequencing. Instead, since the majority of sequences in the soil Illumina data set are unknown, we attempted to use the Illumina data set as a pseudo-'reference metagenome' to identify environmental contaminants in our HeliScope data. However, only 1.13%-9.62% of the sequences in the Helicos data sets (including the HeliScope soil sequence pools) mapped onto the Illumina soil sequences (Table 13). Given our low estimate of endogenous human DNA, we would expect >99% of the HeliScope reads (or 100% in the case of the soil reads) to map onto the Illumina soil data set. This indicates that the HeliScope is generating sequences with too low an information content to align against reference genomes with sufficient precision to be considered likely matches. "Bulk" samples are the unsheared bulks. "Phos." samples are the phosphatase-treated unsheared bulks. "Sheared" samples are the sheared bulks. "% Mapped" indicates the percentage of the total number of reads comprised by the reads mapped onto influenza genomes. "Molecular Length" is the mean length (± standard deviation) of the mapped reads.

DNA length, GC content and sequencing yields
Agilent 2100 Bioanalyzer profiles of the unsheared DNA extracts revealed the typical bimodal distribution of molecular lengths, with a large primary peak between 1000 and 10,000 bp and a small secondary peak around 70 bp. These peaks have previously been determined to correspond primarily to microbial contamination (large peak) and a mixture of fragmented contaminant and authentic ancient DNA (small peak) respectively [20,32,56]. Shearing the extracts produced unimodal distributions with modal lengths between 150 and 200 bp. Shearing of the DNA extracts reduced the total sequence yield drastically (5400-6800-fold reduction in the bone and 6.8-fold reduction in the soil), but yielded a concomitant 8.0-10.6-fold enrichment in the proportion of sequences mapping to the human genome in the bone samples and a 2.3-fold enrichment in human sequences in the soil. The cause of this yield reduction remains unclear. Variation between runs cannot completely explain it since the sheared and unsheared soil samples were sequenced on the same chip and samples from unrelated projects on the same chip had typical HeliScope sequencing yields [D. Jones, Pers. Comm. 2013]. Hypothetically, oxidative products (such as 8-oxoguanine and abasic sites) near the 3′-terminus of the sheared DNA molecule could disrupt the downstream poly (A)-tailing reaction and reduce the overall sequencing yield. Costello and colleagues [57] noted that shearing DNA samples to 150 bp lengths using the Covaris instrument produced 8oxoguanine lesions yielding C → A transversion artifacts in downstream Illumina library production. Nevertheless, 8-oxoguanine lesions were rare and were only observable in libraries constructed from small initial DNA inputs. 8oxoguanine lesions thus seem an implausible explanation for our results, although the production of abasic sites at the 3′-terminus cannot be ruled out.
An alternative explanation is that the shearing caused a decrease in the percentage of denaturable DNA molecules. DNA molecules that cannot be rendered single-stranded are unsequenceable on the HeliScope platform, thus reducing the effective input DNA concentration. One mechanism could be a relative increase in GC-rich sequences in the sheared samples, thus increasing the energy required for denaturation of these molecules. Shearing longer GC-rich DNAs into smaller and more numerous molecules would increase the apparent GC content since this would raise the relative number of available 3′-termini from the GC-rich sequences while the shorter AT-rich sequences would not shear into as many sequenceable fragments. Although the HeliScope platform is noted for its improved performance in both GC-and AT-rich regions in comparison to other high-throughput technologies, it is not impervious to these biochemical effects. This would cause an asymmetric denaturation step with only shorter and/or AT-rich molecules becoming single-stranded in the pool. An alternative, similar mechanism is that the sheared DNA molecules (which mostly derive from the high-molecular-weight peak) are more likely to be cross-linked to proteins and themselves, thus preventing denaturation and sequencing. Since the shearing raised the average length of the molecules (in comparison to the smaller 'ancient DNA' peak), this indicates that a far greater percentage of the sequenced molecules in the sheared pools were derived from the high-molecularweight 'microbial contaminant' peak than from the 'ancient DNA' peak. Many microbial metagenomes have been noted to be particularly GC-rich. Our samples have a high GC content at baseline (mean 62.4%, standard deviation 1.1% in the unsheared bone and mean 62.1% in the soil). Shearing reduced the apparent GC content to 46.4% (standard deviation 1.0%) in the bones and a 58.1% in the soil. Moreover, there is a decrease in molecular length in comparison to their unsheared counterparts (median lengths of 26-30 bp in the sheared samples versus 32-34 bp in the unsheared samples). The AT enrichment, reduction in molecular length and increased percentage of human molecules are consistent with the denaturation hypothesis since the sequenced sheared molecules would on average have lower GC contents and be shorter, thus increasing the likelihood of sequencing endogenous molecules.

Antarctic phosphatase treatment
Ginolhac and colleagues [32] reported a 7.5-9.7-fold enrichment in the total sequence yield in their tSMS data after treatment of samples with Antarctic phosphatase. Conversely, we observed no such enrichment in our data; phosphatase-treated samples had 80%-104% relative yield of their untreated counterparts. They also documented a decrease in the proportion of endogenous sequences in the phosphatase-treated samples (34%-50% relative yield compared to untreated samples). In general agreement with Ginolhac and colleagues' results, we observed a decrease in endogenous human sequences after phosphatase treatment (70%-90% relative yield). The discrepancy in overall yields may be due to differences in the concentrations of sequenceable molecules in the DNA extracts, such as those possibly caused by varying burial environments (Pleistocene permafrost versus Mexican highlands), GC contents between the sample (~40% in their data versus~60% in our individuals) and extraction procedures. If our samples had greater concentrations of sequenceable molecules (which is likely since our untreated sample yields exceeded those of Ginolhac and colleagues by~100-fold), it is possible that we saturated the channel and thus limited the effects of phosphatase treatment. The discrepancy between Ginolhac and colleagues' and our phosphatase treatment results could also be a statistical artifact. Due to the large number (typically 10-20 million) of sequences generated in each HeliScope channel and the significant variation between replicate runs (relative standard deviations ranged between 0.66% and 6.5%), the balance of effect sizes versus discriminatory power is difficult to optimize.

Conclusions
Environmental contamination is a critical issue in ancient DNA investigations of diseases. Both false positives and false negatives are a serious problem. Previous investigations of ancient pathogens using both PCR and highthroughput sequencing may need to be re-evaluated because pathogen identifications require extensive verification procedures due to the high risk of false positives. Currently, molecular enrichment to isolate molecules of interest from the complex background (via either hybridization to probes or amplification of 16S regions) is the most promising route for ancient pathogen studies.
Furthermore, although the benefits of single-molecule sequencing are promising, methodological challenges remain in its application to ancient DNA research. We found that tSMS is not immune to GC-content-related biases and that the benefits of phosphatase-treatment are not universal. Analysis of these data remains complex due to the short lengths of the sequenced molecules and limitations of public databases. Bioinformatic methods developed for modern DNA analysis and for longer sequences are not appropriate for HeliScope ancient DNA data.
Finally, further methodological development is required for the identification of ancient pathogens to become routine and reliable. Until these techniques become available, the burden-of-proof is on the researchers reporting the discovery of these disease agents to demonstrate their results' authenticity.