Perl module and PISE wrappers for the integrated analysis of sequence data and SNP features
© Jayashree et al; licensee BioMed Central Ltd. 2009
Received: 02 March 2009
Accepted: 24 May 2009
Published: 24 May 2009
There is a need for software scripts and modules for format parsing, data manipulation, statistical analysis and annotation especially for tasks related to marker identification from sequence data and sequence diversity analysis.
Here we present several new Perl scripts and a module for sequence data diversity analysis. To enable the use of these software with other public domain tools, we also make available PISE (Pasteur Institute Software Environment) wrappers for these Perl scripts and module. This enables the user to generate pipelines for automated analysis, since PISE is a web interface generator for bioinformatics programmes.
A new set of modules and scripts for diversity statistic calculation, format parsing and data manipulation are available with PISE wrappers that enable pipelining of these scripts with commonly used contig assembly and sequence feature prediction software, to answer specific sequence diversity related questions.
Single Nucleotide Polymorphisms (SNPs) are commonly found throughout the genome and provide dense maps over small chromosomal regions. The recent advances in sequencing and genotyping have made large scale SNP diversity analysis possible in several crop species. This helps assess genome variation that can then be harnessed for crop improvement. Sequence diversity information may be desirable across defined groups of sequences, such as candidate gene transcripts from different genotypes, or assembled transcripts for a particular marker from more than one genotype. The grouping could be based on the objective of the study – across race, location, genes or regions within genes. Sequence data analysis usually involves steps such as clustering of sequence data, to determine redundancy levels. Sequence assembly is carried out to generate consensus sequences or contigs and singlets. The user then processes this output to determine presence of microsatellites or SNPs. Along with SNP identification it is also desirable to obtain other aspects from the alignment; such as SNP and indel (insertion-deletions) frequency, the type of variant and haplotypes, PIC value for the SNP and haplotype besides nucleotide diversity (π). Validation of predicted SNP(s) through wet lab experiments is the next step to convert the identified SNP into a genetic marker. Although more than 30 SNP genotyping platforms are currently available, these are both expensive and demand considerable expertise. One solution for validating the identified SNP(s) through cost effective SNP genotyping platform is development of CAPS (cleaved amplified polymorphic sequences) marker by predicting the restriction enzyme that can use the identified SNP as a recognition site.
There are several available software solutions for sequence clustering, and a few popular ones for assembly. The popular group of Clustal programmes , d2-cluster  for EST clustering and cap3 or PCAP [3, 4], the TIGR assembler  or Phrap http://www.phrap.org/phredphrapconsed.html are used for sequence assembly. Similarly, there are many freely available software programmes for the identification of SNPs[7–10]. DnaSP reports on nucleotide polymorphism features from aligned sequence data . None of them however automate group wise identification and reporting of polymorphism statistics and more importantly consider the presence of heterozygous loci in the sequence data. Many available programs read heterozygous SNPs as missing/bad quality sequence data and thus do not consider them for analysis. As a result features such as sequence diversity, PIC of SNP and haplotypes, etc. may be underestimated. The need for a module that could report SNP features for any number of user defined groups coupled with the need to be able to calculate statistics taking into consideration the presence of heterozygous loci led to the development of the SNP DIVersity ESTimator module (divest.pm).
Sequence analysis involves pipelining of data from one software to another and often also includes branched flows such as when annotation of sequences with putative function is also a requirement. Format conversion scripts to convert output of one program to input of another are needed when the user wants to pipeline several tools and modules. Along with output parsing scripts, some degree of automation can be achieved in data analysis tasks. The availability of software environments for pipelining and workflow management help the user to create custom analysis pipelines.
The PISE programme [12, 13] is a robust environment that has been around for many years now and allows integration of internal scripts/tools that are part of the target execution environment as well as external tools that a user runs. Ease of use is achieved through the creation of a graphical user interface (GUI) for all of the programmes/scripts available in the environment and the chaining together of scripts to facilitate automation and analysis. So rather than reinvent a workflow environment; we implemented PISE locally and provided PISE XML wrappers for the Perl scripts and modules developed by us, besides making them available as web services. The availability of the programs and wrapper sripts allow users to implement versatile pipelines either in the familiar browser environment or in the Taverna workbench. The modules and scripts are available to all interested users.
Results and discussion
Functionality of Perl modules and scripts
The module divest.pm is the diversity estimator module that reads an input file derived from assembly programmes like cap3 or the alignment file output from multiple sequence alignment programmes like ClustalW or BioEdit. The file contains the assemblies of reads for each user defined group. The user can choose to calculate diversity statistics for all groups or one particular group. The statistics calculated include nucleotide diversity, number of SNPs, and type of variant whether transition or transversion, SNP frequency, number of haplotypes, the PIC (polymorphism information content) of SNP and haplotype, haplotype frequency and also number and frequency of indels.
SNP and indel frequency are simply calculated as the total length of the sequence in base pairs/number of SNPs or indels. The count of number of haplotypes and haplotype frequency is based on the standard methods . An important addition in these calculations is the consideration of heterozygous loci. The user can replace an N with an H (in case of true heterozygote) based on two peaks in chromatograms obtained from sequencing machines. The algorithm assigns 'H' two alleles (A/T or C/G each with a value of 0.5) and this is used in haplotype analysis and sequence diversity calculations.
where k is the number of SNPs identified in an alignment of 'n' genotypes, L is the number of basepairs and .
where pi is the frequency of the i th allele at a given SNP locus.
where K is the number of haplotypes, pi is the frequency of haplotypes and n is the total number of reads.
The statistics generated is exported in Excel format, besides the script also generates input file to Network 4.502 http://www.fluxus-technology.com/sharenet.htm. Network helps generate evolutionary trees from various data including haplotype and haplotype frequency data.
Scripts that enable the conversion of a multiple sequence alignment into the diversity estimator input file have been currently written for the ClustalW and cap3 programmes. Another script that enables concatenating cap3 contig and singlets file is also available, the output of this file generates a 'unigene' file that can serve as input to the MISA microsatellite analysis programme. An example analysis pipeline is provided in Figure 1 along with the PISE interface to the diversity estimator programme and its output.
Accessibility of modules and scripts
The modules and scripts being provided through the PISE environment can be used for various purposes (a) identifying microsatellites in a set of unigenes, (b) calculating the diversity between two or more groups of data based on geographical location of the genotypes (e.g. cultivated, wild, landraces, etc.), regions of the gene (e.g. intronic and exonic regions) being sampled, (c) sequence diversity features in a candidate gene across several genotypes and (d) identifying SNPs and finding out how many of these are convertible to CAPS, (e) functionally annotate sequences. The ability to chain scripts to generate pipelines reduces the file management burden for the user. A part of the pipeline indicated in Figure 1 can also be executed through the Taverna workflow environment , which provides web service access besides local service access. The advantage to the user would be the accessibility and integration of databases through web services. Some of the scripts are available as soaplab web services at http://188.8.131.52:8080/soaplab2/; under the heading sequence analysis.
As the modules and scripts are being released as open source code, interested users can continue to use as well as make improvements to them.
Modules and scripts that facilitate sequence diversity analysis, format parsing and data manipulation are being made available along with source code. The module/scripts can be chained together with other freely available sequence analysis software to generate pipelines for automated data analysis. The pipelining has been currently made feasible through the PISE web interface generator and a part of it can also be executed on the Taverna workbench. The module and scripts are available to interested users.
Availability & requirements
Modules available from: http://www.icrisat.org/gt-bt/softwares_downloads.htm.
Can be accessed and used at http://hpc.icrisat.cgiar.org/Pise/5.a/
Soaplab service: http://184.108.40.206:8080/soaplab2/
Operating System: platform independent
Programming language: Perl
Any restrictions to use by non-academics: none
Ajax Command Definition
Single Nucleotide Polymorphism
Polymorphism Information Content
Extensible Markup Language
Pasteur Institute Software Environment
Cleaved Amplified Polymorphic markers
Expressed Sequence Tag
Parallel Contig Assembly Program
Comprehensive Perl Archive Network
MIcroSAtellite Identification tool
The authors gratefully acknowledge support through the Generation Challenge Programme (GCP). The authors gratefully acknowledge Martin Senger's(IRRI) help with implementing the pipeline in Taverna.
- Ramu C, Hideaki S, Tadashi K, Rodrigo L, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003, 31: 3497-3500.View ArticleGoogle Scholar
- Burke J, Davison D, Hide W: d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences. Genome Res. 1999, 9: 1135-1142.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9: 868-77.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Wang J, Aluru S, Yang S-P, Hillier L: PCAP: A Whole-Genome Assembly Program. Genome Res. 2003, 13: 2164-2170.PubMed CentralView ArticlePubMedGoogle Scholar
- Sutton G, White O, Adams M, Kerlavage A: TIGR assembler: A new tool for assembling large shotgun sequencing projects. Genome Sci Technol. 1995, 1: 9-19.View ArticleGoogle Scholar
- Phrap documentation. [http://www.phrap.org/phredphrap/phrap.html]
- Batley J, Jewell E, Edwards D: Automated discovery of single nucleotide polymorphism and simple sequence repeat molecular genetic markers. Methods Mol Biol. 2007, 406: 473-94.PubMedGoogle Scholar
- Savage D, Batley J, Erwin T, Logan E, Love CG, Lim GA, Mongin E, Barker G, Spangenberg GC, Edwards D: SNPServer: a real-time SNP discovery tool. Nucleic Acids Res. 2005, 33: W493-5.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang J, Leunissen JA, Voorrips RE, Linden van der CG, Vosman B: HaploSNPer: a web-based allele and SNP detection tool. BMC Genet. 2008, 9: 23-PubMed CentralView ArticlePubMedGoogle Scholar
- Tang J, Vosman B, Voorrips RE, Linden van der CG, Leunissen JA: QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics. 2006, 7: 438-PubMed CentralView ArticlePubMedGoogle Scholar
- Rozas J, Sánchez-DelBarrio JC, Messeguer X, Rozas R: DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics. 2003, 19: 2496-2497.View ArticlePubMedGoogle Scholar
- Letondal C: A web interface generator for molecular biology programs in Unix. Bioinformatics. 2001, 17: 73-82.View ArticlePubMedGoogle Scholar
- Gilbert D: Pise: Software for building bioinformatics webs. Brief Bioinform. 2002, 3: 405-409.View ArticlePubMedGoogle Scholar
- Rafalski A: Applications of single nucleotide polymorphisms in crop genetics. Curr Opin Plant Biol. 2002, 5: 94-100.View ArticlePubMedGoogle Scholar
- Halushka MA, Fan JB, Bentley K, Hsie L, Shen N, et al: Patterns of single nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet. 1999, 22: 239-247.View ArticlePubMedGoogle Scholar
- Kota R, Rudd S, Facius A, Kolesov G, Thiel T, Zhang H, Sten N, Mayer K, Graner A: Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol Gen Genomics. 2003, 270: 24-33.View ArticleGoogle Scholar
- Beaty TH, Fallin MD, Hetmanski JB, McIntosh I, Chong SS, Ingersoll R, Sheng X, Chakraborty R, Scott F: Haplotype Diversity in 11 Candidate Genes Across Four Populations. Genetics. 2005, 171: 259-267.PubMed CentralView ArticlePubMedGoogle Scholar
- Sekino M, Hara M, Taniguchi N: Genetic diversity within and between hatchery strains of Japanese flounder Paralichthys olivaceus assessed by means of microsatellite and mitochondrial DNA sequencing analysis. Aquaculture. 2002, 213: 101-122.View ArticleGoogle Scholar
- Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 2006, W729-W732. 34 web Server
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, providedthe original work is properly cited.