Functionality of Perl modules and scripts
The module divest.pm is the diversity estimator module that reads an input file derived from assembly programmes like cap3 or the alignment file output from multiple sequence alignment programmes like ClustalW or BioEdit. The file contains the assemblies of reads for each user defined group. The user can choose to calculate diversity statistics for all groups or one particular group. The statistics calculated include nucleotide diversity, number of SNPs, and type of variant whether transition or transversion, SNP frequency, number of haplotypes, the PIC (polymorphism information content) of SNP and haplotype, haplotype frequency and also number and frequency of indels.
SNP and indel frequency are simply calculated as the total length of the sequence in base pairs/number of SNPs or indels. The count of number of haplotypes and haplotype frequency is based on the standard methods [14]. An important addition in these calculations is the consideration of heterozygous loci. The user can replace an N with an H (in case of true heterozygote) based on two peaks in chromatograms obtained from sequencing machines. The algorithm assigns 'H' two alleles (A/T or C/G each with a value of 0.5) and this is used in haplotype analysis and sequence diversity calculations.
The nucleotide diversity π is calculated using the formula [15]
where k is the number of SNPs identified in an alignment of 'n' genotypes, L is the number of basepairs and
.
The PIC of SNP is calculated using the formula [16]
where pi is the frequency of the i th allele at a given SNP locus.
The haplotype diversity is calculated using the formula [17, 18]
where K is the number of haplotypes, pi is the frequency of haplotypes and n is the total number of reads.
The statistics generated is exported in Excel format, besides the script also generates input file to Network 4.502 http://www.fluxus-technology.com/sharenet.htm. Network helps generate evolutionary trees from various data including haplotype and haplotype frequency data.
Scripts that enable the conversion of a multiple sequence alignment into the diversity estimator input file have been currently written for the ClustalW and cap3 programmes. Another script that enables concatenating cap3 contig and singlets file is also available, the output of this file generates a 'unigene' file that can serve as input to the MISA microsatellite analysis programme. An example analysis pipeline is provided in Figure 1 along with the PISE interface to the diversity estimator programme and its output.
Accessibility of modules and scripts
Figure 2 illustrates the XML specification for a module like divest.pm. Chaining together of programmes is made feasible through the parameter setting in the XML file, piping output files to input file of the next programme. In a PISE implementation available at http://hpc.icrisat.cgiar.org/Pise/5.a/, all mentioned scripts are available under assembly, format parsing, statistics and data manipulation directories. A sample ACD file for the divest module is also given, this provides meta-data regarding the input and output files for this module and can be used by soaplab for deploying divest.pm as a web service.
Intended use
The modules and scripts being provided through the PISE environment can be used for various purposes (a) identifying microsatellites in a set of unigenes, (b) calculating the diversity between two or more groups of data based on geographical location of the genotypes (e.g. cultivated, wild, landraces, etc.), regions of the gene (e.g. intronic and exonic regions) being sampled, (c) sequence diversity features in a candidate gene across several genotypes and (d) identifying SNPs and finding out how many of these are convertible to CAPS, (e) functionally annotate sequences. The ability to chain scripts to generate pipelines reduces the file management burden for the user. A part of the pipeline indicated in Figure 1 can also be executed through the Taverna workflow environment [19], which provides web service access besides local service access. The advantage to the user would be the accessibility and integration of databases through web services. Some of the scripts are available as soaplab web services at http://220.227.242.214:8080/soaplab2/; under the heading sequence analysis.
As the modules and scripts are being released as open source code, interested users can continue to use as well as make improvements to them.