Simulating a population genomics data set using FlowSim

Malde, Ketil

doi:10.1186/1756-0500-7-68

Research article
Open access
Published: 31 January 2014

Simulating a population genomics data set using FlowSim

Ketil Malde¹

BMC Research Notes volume 7, Article number: 68 (2014) Cite this article

1500 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

Background

The field of population genetics use the genetic composition of populations to study the effects of ecological and evolutionary factors, including selection, genetic drift, mating structure, and migration. Until recently, these studies were usually based upon the analysis of relatively few (typically 10–20) DNA markers on samples from multiple populations. In contrast, high-throughput sequencing provides large amounts of data and consequently very high resolution genetic information. Recent technological developments are rapidly making this a cost-effective alternative. In addition, sequencing allows both the direct study of genomic differences between population, and the discovery of single nucleotide polymorphism marker that can be subsequently used in high-throughput genotyping. Much of the analysis in population genetics was developed before large scale sequencing became feasible. Methods often do not take into account the characteristics of the different sequencing technologies, and consequently, may not always be well suited to this kind of data.

Results

Although the FlowSim suite of tools originally targeted simulation of de novo 454 genomics data, recent developments and enhancements makes it suitable also for simulating other kinds of data. We examine its application to population genomics, and provide examples and supplementary scripts and utilities to aid in this task.

Conclusions

Simulation is an important tool to study and develop methods in many fields, and here we demonstrate how to simulate a high-throughput sequencing dataset for population genomics.

Background

Simulation is an important tool for developing and experimenting with methods for analysis of sequencing data. Several simulators exist, usually targeting specific data types or analyses. For instance, MetaSim [1] targets metagenomic samples, and SimSeq (St. John, unpublished) and Wgsim [2] target Illumina sequences.

As implied by the name, FlowSim [3] was originally developed for simulation of de novo genomics data on the 454 platform. Since its inception, it has grown into a flexible suite of tools that can be applied to a number of different uses, and here we demonstrate how it can simulate a population genomics data set consisting of Illumina reads.

A sequencing dataset for population genomics typically consists of reads from pools of individuals from a species, where each pool is taken from a specific populations or subpopulation of interest. By identifying and quantifying variants in the different pools, one can calculate the degree of divergence and population structure between the populations. In turn, this information can be used to study evolution [4, 5], quantitative traits [6], and also constitutes an important tool for estimating biological diversity.

The FlowSim suite

The current version of FlowSim (0.3.5) consists of several independent components, as illustrated in Figure 1. Each component is implemented as a separate tool, using FASTA-formatted sequences for input and output. (The exception is flowsim, which outputs the native SFF file format. FASTA-formatted sequence can be trivially extracted, e.g. using the flower[7] tool). This makes it easy for the user to construct a custom simulation pipeline tailored to his or her needs. Here, we will make use of clonesim to generate sets of reads, mutator to simulate genetic divergence in the form of SNPs as well as simulating sequencing errors in the simulated reads, and duplicator to introduce artificial duplicates.

Methods and results

Under the current simulations, a population consists of a number of individuals with specific genetic variations. For simplicity, we will consider our populations as a sets of genome sequences, each similar to a reference genome, but differing in a set of locations with unique substitutions. We will refer to these genomes as the haplotypes of the population. Each haplotype (and thus its specific genomic variants) occurs with a specific frequency in the population as a whole.

Starting with a single haplotype (i.e., a reference genome or chromosome), we generate the new haplotypes by introducing random mutations. The mutations are identified, and noted separately. The resulting haplotypes are then concatenated in desired multiplicities into a combined genome representing each population, and sets of simulated reads are generated by selecting fragments randomly from the the population genomes. Finally, to simulate sequencing errors, artifacts [8], and the occurrence of rare variants [9], the reads have additional variations introduced. Also, a random selection of reads are output multiple times in order to simulate the occurrence of artificial duplicates [10, 11].

Implementation

We will presume that our reference genome exists in a file called genome.fasta. First the set of haplotypes are generated by using mutator to randomly introduce on average five mutations per kilobase, using the option -s 0.005. To simplify analysis, we do not introduce indels (-i 0). The following script implements the analysis.

Although here we generate intermediate files, each step can also read from standard input and write to standard output. Thus, intermediate files can be omitted using UNIX pipes.

The next step simply concatenates the haplotypes in different proportions to construct the population genomes, p1.fasta and p2.fasta. Here, we combined the three haplotypes H₁, H₂ and H₃ in proportions of 1:2:3 in population P₁, and 3:2:1 in population P₂, as shown in Figure 2. As a result, an allele present in H₁ (i.e., H1.fasta) will have a minor allele frequency of 0.167 in population P₁, and 0.5 in P₂, giving it an a priori F_ST of 0.125, while variant alleles in H₂ will occur with an equal minor allele frequency (of 0.333) in either population, resulting in an F_st of 0.

In step three, we can use clonesim to generate reads by extracting 20 M (-c 2000000) random fragments of exactly 100 bp length (using the -l option to set the length distribution to Uniform 100 100). The generated reads are exact copies of fragments of the reference genome, and in order to simulate sequencing errors and rare variants, in step four we again apply mutator, this time allowing indels as well as substitutions. Finally, we randomly duplicate some of the reads, using the duplicator tool.

Additional analysis

FlowSim provides the basic building blocks for simulating the sequencing process, but analysis often depends on additional information, and sometimes requires intermediate steps to adapt the data.

A natural step in the analysis of sequence reads, simulated or otherwise, is to map them to a reference genome. This is also useful to verify that the data exhibits the expected properties, like coverage distribution or error rates. The simulation here produces FASTA sequences, but most short read mapping software accept FASTQ as input. Converting from FASTA to FASTQ is a simple task, here a small tool (called fasta2fastq) was written to perform this conversion.

To separate the haplotype variants from simulated sequencing errors, another small tool (snplist) were written to generate the list of variants per haplotype. This compares each haplotype with the reference genome, and outputs a list of the variant positions with reference and alternative allele. To simplify this process, it is conveneint to add the variants identification to e.g. the output from VCFtools [12] or similar variant callers, the following script can be used for this purpose:

Discussion and conclusion

As FlowSim is primarily targeted at accurate simulation of 454 sequencing, in the present study, we have applied a simplistic model for Illumina sequences. For instance, the probability of error is uniform along each read, and independent of base, and factors that can cause sequencing bias, like e.g. the read’s GC content [13] or strand [14], are not taken into account. Sometimes a simple model suffices, and it can also make analysis simpler. However, the individual components of FlowSim can easily be replaced by custom tools, and if a more accurate sequencing model is required, it can be implemented separately, and integrated into the simulation pipeline.

Similarly, we could conceive of a more realistic model for the reference genome, in order to explore properties likely to affect our analysis. For instance, repeats caused by recent duplications (common in many plants and teleosts), transposons, or low complexity regions could have dramatic impacts on analysis. Also artifacts of the reference assembly, where chimeric contigs, collapsed repeats, and contamination could have substantial effects on the result. Again, the user is free to implement appropriate designs and insert them as separate stages in the simulation pipeline.

Here we have explored the use of FlowSim for a population genetics study. A similar approach would also allow it to be used for shotgun metagenomics. In that case, the populations would consist of genomes (haplotypes) from different species, instead of originating in a single reference genome. One might also consider mutations of haplotypes in more complex arrangements to emulate evolution of closely related species.

Availability and requirements

All materials are available on-line, see Table 1 for details. The scripts as well as other tools mentioned are released into the public domain. The documentation for the FlowSim pipeline is available from the Biohaskell Wiki. FlowSim itself is available as Open Source software under the General Public License (GPL) version 2.0.

Table 1 On-line resources and supporting materials

Full size table

References

Richter DC, Ott F, Auch AF, Schmid R, Huson DH: Metasim–a sequencing simulator for genomics and metagenomics. PLoS ONE. 2008, 3 (10): 3373-10.1371/journal.pone.0003373. doi:10.1371/journal.pone.0003373
Article Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAM tools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.
Article PubMed PubMed Central Google Scholar
Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data - enabling realistic simulation with flowsim. Bioinformatics. 2010, 26 (18): i420-i425. 10.1093/bioinformatics/btq365.
Article PubMed CAS PubMed Central Google Scholar
Tajima F: Evolutionary relationship of dna sequences in finite populations. Genetics. 1983, 105 (2): 437-460.
PubMed CAS PubMed Central Google Scholar
Turner TL, Bourne EC, Von Wettberg EJ, Hu TT, Nuzhdin SV: Population resequencing reveals local adaptation of arabidopsis lyrata to serpentine soils. Nat Genet. 2010, 42 (3): 260-263. 10.1038/ng.515.
Article PubMed CAS Google Scholar
Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, Burtt NP, Rivas M, Guiducci C, Bruno DL, Goldberger OA, Redman MC, Wiltshire E, Wilson CJ, Altshuler D, Gabriel SB, Daly MJ, Thorburn DR, Mootha VK: High-throughput, pooled sequencing identifies mutations in nubpl and foxred1 in human complex i deficiency. Nat Genet. 2010, 42 (10): 851-858. 10.1038/ng.659.
Article PubMed CAS PubMed Central Google Scholar
Malde K: Flower: extracting information from pyrosequencing data. Bioinformatics. 2011, 27 (7): 1041-1042. 10.1093/bioinformatics/btr063.
Article PubMed CAS PubMed Central Google Scholar
Balzer S, Malde K, Jonassen I: Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics. 2011, 27 (13): 304-309. 10.1093/bioinformatics/btr251.
Article Google Scholar
Bhatia G, Patterson N, Sankararaman S, Price AL: Estimating and interpreting fst: the impact of rare variants. Genome Res. 2013, 23 (9): 1514-1521. 10.1101/gr.154831.113.
Article PubMed CAS PubMed Central Google Scholar
Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3: 1314-1317. 10.1038/ismej.2009.72.
Article PubMed Google Scholar
Balzer S, Malde K, Grohme M, Jonassen I: Filtering duplicate reads from 454 pyrosequencing data. Bioinformatics. 2013, 29 (7): 830-836. 10.1093/bioinformatics/btt047.
Article PubMed CAS PubMed Central Google Scholar
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group: The variant call format and vcftools. Bioinformatics. 2011, 27 (15): 2156-2158. 10.1093/bioinformatics/btr330.
Article PubMed CAS PubMed Central Google Scholar
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB: Characterizing and measuring bias in sequence data. Genome Biol. 2013, 14: R51-10.1186/gb-2013-14-5-r51.
Article PubMed PubMed Central Google Scholar
Guo1 Y, Li J, Li C-I, Long J, Samuels DC, Shyr Y: The effect of strand bias in illumina short-read sequencing data. BMC Genomics. 2012, 13: 666-10.1186/1471-2164-13-666.
Article Google Scholar

Download references

Acknowledgements

Thanks to Kevin Glover for helpful comments on the manuscript. This article was funded in part by the Research Council of Norway through the SALMAT project (HAVBRUK 226221).

Author information

Authors and Affiliations

Institute of Marine Research, Nordnesgaten 50, Bergen, Norway
Ketil Malde

Authors

Ketil Malde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ketil Malde.

Additional information

Competing interests

The author declare that he has no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Malde, K. Simulating a population genomics data set using FlowSim. BMC Res Notes 7, 68 (2014). https://doi.org/10.1186/1756-0500-7-68

Download citation

Received: 21 November 2013
Accepted: 21 January 2014
Published: 31 January 2014
DOI: https://doi.org/10.1186/1756-0500-7-68

Simulating a population genomics data set using FlowSim