SPANDx description
The SPANDx workflow is shown in Figure 1. SPANDx is a shell package written for implementation in a Linux environment using Bash. SPANDx integrates multiple freely available Linux-based programs (BWA[3, 4], SAMTools[5], Picard, GATK[7, 8], VCFtools[9], BEDTools[6] and SnpEff[10]) into a single pipeline for alignment, variant identification, analysis and annotation from raw NGS data derived from haploid organisms. Using data generated from our prior WGS studies[13–16], we have tested the performance of SPANDx using paired-end Illumina (GA
IIx
, MiSeq and HiSeq2000) data, and single-end Ion PGM, Illumina, and 454 GS-FLX/FLX+ data. SPANDx is designed to run in a cluster environment and utilises parallel processing for the majority of the analysis pipeline. To facilitate parallelisation and appropriate resource allocation, SPANDx requires a Linux/UNIX system with PBS. The SPANDx user manual (available at: http://sourceforge.net/projects/spandx/) provides detailed information on installing, operating and where desired, customising this program.
Variant identification and phylogenetic analysis
Variant (i.e. SNP and indel) identification is a fundamental component of any haploid WGS analysis. For this study, default settings for SPANDx (as detailed in the user manual) were used to identify variants; optional settings were included as follows. The -m flag was used to construct core genome SNP matrices from the individual Escherichia coli or Haemophilus influenzae SNP .vcf files for phylogenetic reconstruction. The SPANDx-generated Ortho_SNP_matrix.nex file was directly imported into PAUP* 4.0b10[17] and used to construct maximum parsimony phylogenetic trees (Figures 2,3 and4). For the seven REL E. coli genomes, SnpEff was implemented (using the -a and -v flags) to annotate SNPs.
Presence/absence (P/A) analysis of E. coli and H. influenzae genomes
Defining the core (i.e. loci present in all taxa) vs. accessory (i.e. loci present in at least one taxon) genome is another fundamental application of haploid WGS. This information can be used for many purposes including pan-genome construction, strain-, species- or genus-level signature identification, or for observing patterns of genome reduction. The coverageBED module of BEDTools has been incorporated into the SPANDx pipeline for this purpose. BEDTools determines NGS read coverage depth and breadth across segments or ‘bins’ relative to the reference genome[6], thereby providing an efficient way of using raw NGS reads to identify both core and accessory genomic loci within a dataset compared with a reference genome. For the current study, a default 1 kb window size was used for P/A analysis. SPANDx automatically generates coverageBED genetic locus P/A outputs from all inputted genomes against the reference genome and combines individual outputs into a single human-readable matrix file (Bedcov_merge.txt). Additional file manipulation of P/A matrices was performed using basic features in MS Excel 2010 to create heat maps.
Example P/A matrices generated by SPANDx for the E. coli and H. influenzae datasets, which highlight the core and accessory genomes of these species compared with the reference genome, are respectively shown in Figures 3 and4. We have previously used these outputs to develop a novel speciation target for Burkholderia ubonensis[13] and to characterise genome reduction in Burkholderia pseudomallei[14].
Optimised variant calling for Illumina and Ion PGM data
Other NGS-based genomics tools such as Galaxy[11] require users to specify variant calling settings, which can be a subjective and time-consuming task, particularly for users unfamiliar with NGS data. To combat this issue, SPANDx includes pre-optimised variant calling for both single- and paired-end haploid NGS data across the Illumina and Ion PGM platforms. Although these settings have been optimised using our test datasets, they can be customised if desired by altering the filtering parameters in GATK.config, a file that comes with the SPANDx distribution.
Phylogenetic analysis of SPANDx SNP outputs
Using a combination of VCFtools, GATK and several quality control and filtering steps, SPANDx automatically generates error-corrected core genome SNP matrices for phylogenetic analysis that can be directly imported into the phylogenetic programs PAUP*, PHYLIP and RAxML, the latter two of which are open-source software. The extensive error checking, filtering and variant identification steps undertaken in SPANDx using GATK ensure that the identified SNPs are as accurate as possible using NGS data. Example maximum parsimony analyses of SPANDx-generated data for the E. coli and H. influenzae datasets are shown in Figures 2,3 and4.
Program comparison: SPANDx vs. BRESEQ
We performed an in-depth comparison of SPANDx with BRESEQ, a comparative genomics tool specifically designed for identifying SNPs, indels and large deletions in closely related microbial-sized genomes (http://barricklab.org/twiki/bin/view/Lab/ToolsBacterialGenomeResequencing). Due to limitations on BRESEQ data inputs, we only compared the six closely related E. coli REL genomes spanning 2 K to 40 K generations; REL607 was not included in the BRESEQ study[19] and was therefore excluded in this comparison. Default settings for SPANDx (as detailed in the SPANDx user manual) were used to identify variants. SnpEff was implemented using the -a and -v flags in SPANDx to annotate SNPs.
SNPs
SPANDx and BRESEQ identified identical SNPs for the 2 K, 5 K, 10 K, 15 K and 20 K mutants (3, 9, 16, 22 and 28 SNPs, respectively)[19]. One additional SNP in the 20 K strain, located at position 2129116 of insB-15 in REL606, was not identified by either SPANDx or BRESEQ and was only discovered by Sanger sequencing[19]. This SNP was not able to be identified from NGS read data due to the paralogous nature of the IS1 insertion sequence element in this genome. BLAST analysis of IS1 in REL606 identified 27 highly related copies (>99% match across 100% of bases), with up to three SNPs present among the paralogues. Using NGS data, especially data harbouring relatively small insert sizes (~80-170 bp with this dataset), such loci cannot be accurately mapped. Therefore, the exclusion of this SNP from both the SPANDx and BRESEQ pipelines demonstrates the inherent limitations of using short-read NGS data for variant calling in large paralogous loci.
SPANDx analysis of the 40 K strain identified only 608 SNPs separating the hypermutable strain REL10938 from its REL606 ancestor, compared with the 626 SNPs found using BRESEQ. Closer examination found that these 18 SNPs were either not identified by SPANDx or were excluded using the default filtering parameters due to non-polymorphic (n = 1) or ambiguous (n = 11) genotypes, or poor mapping quality and/or insufficient (<0.5× of average) coverage (n = 6). The default parameters for SNP calling in SPANDx have been optimised such that the ability to identify only ‘real’ variants is maximised; false-positives are not tolerated with these settings, in line with GATK recommendations. Loosening of these parameters results in additional SNPs being identified, some of which may turn out to be ‘real’ upon confirmation with e.g. Sanger sequencing; however, the trade-off is that false-positives begin plaguing the dataset (results not shown). Given the nature of NGS data and the behaviour of NGS alignment programs, neither variant calling method is incorrect per se, but these minor differences between programs highlight the need to verify questionable SNPs from NGS data using secondary methods including manual inspection of NGS read alignments in e.g. Tablet[22], or wet laboratory-based analyses such as Sanger sequencing or allele-specific PCR.
Indels and chromosomal rearrangements
Comparison of SPANDx and BRESEQ for identifying small (<20 bp) indels in the REL strain cohort demonstrated that both methods were identical (variants are detailed in Supplementary Table two from[19]). Neither method identified a known 1.49Mbp inversion[23].
Large deletions and insertions
Large insertions are not currently able to be detected using SPANDx. However, for highly related strains these signatures can be detected with BRESEQ, as exemplified by the identification of ten IS element insertions with BRESEQ that were not found by SPANDx. Identification of large deletions (>20 bp) showed that, on a gross level, there was good consistency between the programs. However, the size of the deletions varied between SPANDx and BRESEQ, with SPANDx overestimating deletion size for three of the five identified deletions by ~0.7 to 1.4 kb. BLAST analysis of these regions showed that the additional sequence called as ‘deleted’ by SPANDx corresponded with paralogous IS element loci (results not shown). This finding was expected, being consistent with inherent read mapping difficulties across paralogous loci using short-read NGS data.
Program comparison: SPANDx vs. Galaxy
Although we did not directly test Galaxy in this study, a previous study has used this program to compare E. coli strain REL607 with REL606[18]. SPANDx identified that REL607 is a dual-nucleotide variant of REL606 at the araA and recD loci (Figure 2); no indels were found by either program. Thus, SPANDx confirmed previous variant findings identified using Galaxy[18].
Cross-platform reproducibility of SPANDx
The performance of bioinformatics tools across multiple NGS platforms is an important consideration for analysis reproducibility and program utility. To address this question, we tested the performance of SPANDx using 19 H. influenzae strains subjected to two different NGS platforms: single-end Ion PGM and paired-end Illumina (MiSeq and HiSeq2000). SPANDx constructed almost identical core genome SNP phylogenies with these two datasets (Figure 4) despite being generated from platforms with inherently different error profiles and chemistries. In addition, P/A determination across these 19 genomes was essentially identical with these two platforms (Figure 4). These data demonstrate the robustness and accuracy of SPANDx across multiple NGS platforms.