Simulating a population genomics data set using FlowSim
© Malde; licensee BioMed Central Ltd. 2014
Received: 21 November 2013
Accepted: 21 January 2014
Published: 31 January 2014
The field of population genetics use the genetic composition of populations to study the effects of ecological and evolutionary factors, including selection, genetic drift, mating structure, and migration. Until recently, these studies were usually based upon the analysis of relatively few (typically 10–20) DNA markers on samples from multiple populations. In contrast, high-throughput sequencing provides large amounts of data and consequently very high resolution genetic information. Recent technological developments are rapidly making this a cost-effective alternative. In addition, sequencing allows both the direct study of genomic differences between population, and the discovery of single nucleotide polymorphism marker that can be subsequently used in high-throughput genotyping. Much of the analysis in population genetics was developed before large scale sequencing became feasible. Methods often do not take into account the characteristics of the different sequencing technologies, and consequently, may not always be well suited to this kind of data.
Although the FlowSim suite of tools originally targeted simulation of de novo 454 genomics data, recent developments and enhancements makes it suitable also for simulating other kinds of data. We examine its application to population genomics, and provide examples and supplementary scripts and utilities to aid in this task.
Simulation is an important tool to study and develop methods in many fields, and here we demonstrate how to simulate a high-throughput sequencing dataset for population genomics.
Simulation is an important tool for developing and experimenting with methods for analysis of sequencing data. Several simulators exist, usually targeting specific data types or analyses. For instance, MetaSim  targets metagenomic samples, and SimSeq (St. John, unpublished) and Wgsim  target Illumina sequences.
As implied by the name, FlowSim  was originally developed for simulation of de novo genomics data on the 454 platform. Since its inception, it has grown into a flexible suite of tools that can be applied to a number of different uses, and here we demonstrate how it can simulate a population genomics data set consisting of Illumina reads.
A sequencing dataset for population genomics typically consists of reads from pools of individuals from a species, where each pool is taken from a specific populations or subpopulation of interest. By identifying and quantifying variants in the different pools, one can calculate the degree of divergence and population structure between the populations. In turn, this information can be used to study evolution [4, 5], quantitative traits , and also constitutes an important tool for estimating biological diversity.
The FlowSim suite
Methods and results
Under the current simulations, a population consists of a number of individuals with specific genetic variations. For simplicity, we will consider our populations as a sets of genome sequences, each similar to a reference genome, but differing in a set of locations with unique substitutions. We will refer to these genomes as the haplotypes of the population. Each haplotype (and thus its specific genomic variants) occurs with a specific frequency in the population as a whole.
Starting with a single haplotype (i.e., a reference genome or chromosome), we generate the new haplotypes by introducing random mutations. The mutations are identified, and noted separately. The resulting haplotypes are then concatenated in desired multiplicities into a combined genome representing each population, and sets of simulated reads are generated by selecting fragments randomly from the the population genomes. Finally, to simulate sequencing errors, artifacts , and the occurrence of rare variants , the reads have additional variations introduced. Also, a random selection of reads are output multiple times in order to simulate the occurrence of artificial duplicates [10, 11].
Although here we generate intermediate files, each step can also read from standard input and write to standard output. Thus, intermediate files can be omitted using UNIX pipes.
In step three, we can use clonesim to generate reads by extracting 20 M (-c 2000000) random fragments of exactly 100 bp length (using the -l option to set the length distribution to Uniform 100 100). The generated reads are exact copies of fragments of the reference genome, and in order to simulate sequencing errors and rare variants, in step four we again apply mutator, this time allowing indels as well as substitutions. Finally, we randomly duplicate some of the reads, using the duplicator tool.
FlowSim provides the basic building blocks for simulating the sequencing process, but analysis often depends on additional information, and sometimes requires intermediate steps to adapt the data.
A natural step in the analysis of sequence reads, simulated or otherwise, is to map them to a reference genome. This is also useful to verify that the data exhibits the expected properties, like coverage distribution or error rates. The simulation here produces FASTA sequences, but most short read mapping software accept FASTQ as input. Converting from FASTA to FASTQ is a simple task, here a small tool (called fasta2fastq) was written to perform this conversion.
Discussion and conclusion
As FlowSim is primarily targeted at accurate simulation of 454 sequencing, in the present study, we have applied a simplistic model for Illumina sequences. For instance, the probability of error is uniform along each read, and independent of base, and factors that can cause sequencing bias, like e.g. the read’s GC content  or strand , are not taken into account. Sometimes a simple model suffices, and it can also make analysis simpler. However, the individual components of FlowSim can easily be replaced by custom tools, and if a more accurate sequencing model is required, it can be implemented separately, and integrated into the simulation pipeline.
Similarly, we could conceive of a more realistic model for the reference genome, in order to explore properties likely to affect our analysis. For instance, repeats caused by recent duplications (common in many plants and teleosts), transposons, or low complexity regions could have dramatic impacts on analysis. Also artifacts of the reference assembly, where chimeric contigs, collapsed repeats, and contamination could have substantial effects on the result. Again, the user is free to implement appropriate designs and insert them as separate stages in the simulation pipeline.
Here we have explored the use of FlowSim for a population genetics study. A similar approach would also allow it to be used for shotgun metagenomics. In that case, the populations would consist of genomes (haplotypes) from different species, instead of originating in a single reference genome. One might also consider mutations of haplotypes in more complex arrangements to emulate evolution of closely related species.
Availability and requirements
Thanks to Kevin Glover for helpful comments on the manuscript. This article was funded in part by the Research Council of Norway through the SALMAT project (HAVBRUK 226221).
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH: Metasim–a sequencing simulator for genomics and metagenomics. PLoS ONE. 2008, 3 (10): 3373-10.1371/journal.pone.0003373. doi:10.1371/journal.pone.0003373View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAM tools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data - enabling realistic simulation with flowsim. Bioinformatics. 2010, 26 (18): i420-i425. 10.1093/bioinformatics/btq365.PubMedPubMed CentralView ArticleGoogle Scholar
- Tajima F: Evolutionary relationship of dna sequences in finite populations. Genetics. 1983, 105 (2): 437-460.PubMedPubMed CentralGoogle Scholar
- Turner TL, Bourne EC, Von Wettberg EJ, Hu TT, Nuzhdin SV: Population resequencing reveals local adaptation of arabidopsis lyrata to serpentine soils. Nat Genet. 2010, 42 (3): 260-263. 10.1038/ng.515.PubMedView ArticleGoogle Scholar
- Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, Burtt NP, Rivas M, Guiducci C, Bruno DL, Goldberger OA, Redman MC, Wiltshire E, Wilson CJ, Altshuler D, Gabriel SB, Daly MJ, Thorburn DR, Mootha VK: High-throughput, pooled sequencing identifies mutations in nubpl and foxred1 in human complex i deficiency. Nat Genet. 2010, 42 (10): 851-858. 10.1038/ng.659.PubMedPubMed CentralView ArticleGoogle Scholar
- Malde K: Flower: extracting information from pyrosequencing data. Bioinformatics. 2011, 27 (7): 1041-1042. 10.1093/bioinformatics/btr063.PubMedPubMed CentralView ArticleGoogle Scholar
- Balzer S, Malde K, Jonassen I: Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics. 2011, 27 (13): 304-309. 10.1093/bioinformatics/btr251.View ArticleGoogle Scholar
- Bhatia G, Patterson N, Sankararaman S, Price AL: Estimating and interpreting fst: the impact of rare variants. Genome Res. 2013, 23 (9): 1514-1521. 10.1101/gr.154831.113.PubMedPubMed CentralView ArticleGoogle Scholar
- Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3: 1314-1317. 10.1038/ismej.2009.72.PubMedView ArticleGoogle Scholar
- Balzer S, Malde K, Grohme M, Jonassen I: Filtering duplicate reads from 454 pyrosequencing data. Bioinformatics. 2013, 29 (7): 830-836. 10.1093/bioinformatics/btt047.PubMedPubMed CentralView ArticleGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group: The variant call format and vcftools. Bioinformatics. 2011, 27 (15): 2156-2158. 10.1093/bioinformatics/btr330.PubMedPubMed CentralView ArticleGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB: Characterizing and measuring bias in sequence data. Genome Biol. 2013, 14: R51-10.1186/gb-2013-14-5-r51.PubMedPubMed CentralView ArticleGoogle Scholar
- Guo1 Y, Li J, Li C-I, Long J, Samuels DC, Shyr Y: The effect of strand bias in illumina short-read sequencing data. BMC Genomics. 2012, 13: 666-10.1186/1471-2164-13-666.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.