Effect of diversity and missing data on genetic assignment with RAD-Seq markers
© Chattopadhyay et al.; licensee BioMed Central Ltd. 2014
Received: 22 July 2014
Accepted: 18 November 2014
Published: 25 November 2014
Reduced representation libraries are being used as a preferred source of markers to address population genetic questions. However, libraries of RAD-Seq variants often suffer from significant percentage of missing data. In addition, algorithms used to mine SNPs from the raw data may also underscore biological variation. We investigate the effect of biological diversity in mining SNPs from the program STACKS and the effect of missing data on individual assignment implemented in STRUCTURE. We observed that changing diversity parameters in STACKS significantly alters the number of SNPs discovered and allowing for higher percentage of missing data retrieves more loci and possibly more power for individual assignment.
RAD-Seq and missing data
Reduced representation genomic libraries are increasingly used to answer diverse questions in evolutionary biology, which remained unresolved otherwise. Various restriction-site based genome scans have become standard tools for both population genetic and phylogenetic analyses . Although extremely useful, these techniques are cost prohibitive and generation of data from hundreds of individuals across multiple runs may not really be an option for many research labs. Besides, technical issues related to data generation result in the requirement of multiple runs to troubleshoot, thereby increasing the cost manifold.
RAD-Seq (restriction site-based reduced representation genomic libraries) generates tens of thousands of loci per individual, but overlapping loci across all individuals are much fewer, resulting in significant missing data. Since missing data could impact inference, it is important to test its effect on the analysis. While missing data may not significantly impact phylogenetic inference , other forms of population genetic inferences remain untested.
We compared genetic assignment and group membership of individuals as inferred by nuclear autosomal microsatellite loci and genome-wide single nucleotide polymorphisms (SNPs) sites obtained from RAD-Seq for ten individuals of two Cynopterine fruit bat species Cynopterus sphinx and C. brachyotis (Additional file 1: Table S1). We assessed the effect of missing data on group assignment in RAD-Seq. Because software tools are used to ascertain SNPs in RAD data, SNP calling is impacted by assumptions of the software tools. Thus the number and quality of markers obtained is highly influenced by the software. We also investigated how such assumptions impact results, specifically the effect of change in diversity parameters (mismatch between loci) within a widely used tool for mining SNPs with RAD-Seq data, STACKS 1.09 .
Library preparation and analyses
As an ongoing project to estimate gene flow and understand its dynamics between these two bat species in sympatry, we used data from ten individuals from a pool of 387 genotyped individuals (Additional file 1). We prepared standard RAD libraries with individual barcodes (Additional file 1: Table S1). We performed a standard paired end run (single lane) in HiSeq 1000. Data output was 91 million reads, but paired-end quality was poor and was not considered in further analyses (Additional file 1). We subsequently analyzed ~47 million forward reads. We observed more than 90% data loss due to ambiguous barcode (Additional file 1). The average number of reads per individual was 468,612.3 (range: 366,389 to 731,138, Additional file 1: Table S1). Within the STACKS pipeline, the basic algorithm for arranging reads into stacks depends upon absolute nucleotide matches and has often been regarded as conservative . Additionally, there is evidence that when sequence diversity is high (as in our case where we are examining sequences from two distinctly diverged genomes) stacks may remove a majority of the loci from its analysis, or may separate single locus into two .
We assessed the sensitivity of STACKS to these deviations in differentiating between these well-defined taxa. First we obtained SNPs from different run parameters, with 50% missing data. Further, for the parameter combination that provided meaningful number of loci, we obtained SNPs with different extent of missing data (10%, 30%, 50%, 70% and 90%). For all these parameter sets we performed independent STRUCTURE runs (50,000 burnin and 100,000 mcmc) considering two genotypic clusters (K =2) . Each run was replicated five times and the mean of ancestry coefficient across all these replicates was used to obtain trends.
The sampling were approved by the institutional ethics committees (Internal Research Review Board (IRB), Ethical Clearance (EC), Biosafety and Animal Welfare committee approval to BC dated 21-11-2005 Madurai Kamaraj University and Institutional Animal Ethics Committee (IACE) to UR id UR-3/2009, National Centre for Biological Sciences).
Raw sequence reads have been deposited in the Sequence Read Archive (SRA) (accession no. SRP042963).
UR acknowledges funding from DST grant (AS-65) for RAD sequencing. BC was supported by DAE grant ‘Biogeography of the Indian subcontinent’ to UR. The authors thank Arjun Sivasundar, Julian Catchen, Paul Etter and Scott Edwards for technical advice; Meghana Natesh for her comments on the previous version of the manuscript; Sripathi Kandula for samples and cCAMP (Project ID: BT/PR3481/INF/22/140/2011) for the Illumina Hi-Seq run.
- Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML: Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet. 2011, 12: 499-510. 10.1038/nrg3012.PubMedView ArticleGoogle Scholar
- Hovmoller R, Lacey Knowles L, Kubatko LS: Effects of missing data on species tree estimation under the coalescent. Mol Phylogenet Evol. 2013, 69: 1057-1062. 10.1016/j.ympev.2013.06.004.PubMedView ArticleGoogle Scholar
- Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA: Stacks: an analysis tool set for population genomics. Mol Ecol. 2013, 22: 3124-3140. 10.1111/mec.12354.PubMedPubMed CentralView ArticleGoogle Scholar
- Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE: Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One. 2012, 7: e37135-10.1371/journal.pone.0037135.PubMedPubMed CentralView ArticleGoogle Scholar
- Eaton DAR: PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics. 2014, btu121Google Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMedPubMed CentralGoogle Scholar
- Huang H, Knowles LL: Unforeseen consequences of excluding missing data from next-generation sequences: simulation study of RAD sequences. Syst Biol. 2014, syu046-Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.