- Technical Note
SigWinR; the SigWin-detector updated and ported to R
BMC Research Notesvolume 2, Article number: 205 (2009)
Our SigWin-detector discovers significantly enriched windows of (genomic) elements in any sequence of values (genes or other genomic elements in a DNA sequence) in a fast and reproducible way. However, since it is grid based, only (life) scientists with access to the grid can use this tool. Therefore and on request, we have developed the SigWinR package which makes the SigWin-detector available to a much wider audience. At the same time, we have introduced several improvements to its algorithm as well as its functionality, based on the feedback of SigWin-detector end users.
To allow usage of the SigWin-detector on a desktop computer, we have rewritten it as a package for R: SigWinR. R is a free and widely used multi platform software environment for statistical computing and graphics. The package can be installed and used on all platforms for which R is available. The improvements involve: a visualization of the input-sequence values supporting the interpretation of Ridgeograms; a visualization allowing for an easy interpretation of enriched or depleted regions in the sequence using windows of pre-defined size; an option that allows the analysis of circular sequences, which results in rectangular Ridgeograms; an application to identify regions of co-altered gene expression (ROCAGEs) with a real-life biological use-case; adaptation of the algorithm to allow analysis of non-regularly sampled data using a constant window size in physical space without resampling the data. To achieve this, support for analysis of windows with an even number of elements was added.
By porting the SigWin-detector as an R package, SigWinR, improving its algorithm and functionality combined with adequate performance, we have made SigWin-detector more useful as well as more easily accessible to scientists without a grid infrastructure.
For the detection of significantly enriched windows of elements in any sequence of values in a fast and reproducible way, we developed and published a workflow and grid-based tool; SigWin-detector. For instance, elements may be genes and a sequence may consist of values attributed to these genes. SigWin-detector is based on a moving median false discovery rate (mmFDR) procedure using an exact formula. SigWin-detector visualizes significantly enriched windows by Ridgeograms; the sequence is depicted by stacking increasing window sizes from 1 onward, thus forming a triangle. Enriched or depleted windows are marked by a color. Windows in the input sequence are considered to be significantly enriched, if they have a median value that deviates significantly from the expected value assuming random ordering of the values in the input sequence. The development of SigWin-detector was originally motivated by the need to identify regions of increased gene expression (RIDGEs) in the human transcriptome map (HTM) [1, 2], see Additional file 1. However, the applicability of the tool is much wider, as we discovered that SigWin-detector can also be used to identify regions of co-altered gene expression (ROCAGE).
Because SigWin-detector has been implemented on a grid-platform, and many life scientists do not have access to grid resources, we have received requests from users for a SigWin-detector that operates in a non-grid environment. We therefore have ported our SigWin-detector to R , the most commonly used statistical language in omics research. At the same time, we have extended the underlying algorithm and the functionality of SigWin-detector. The R package is called SigWinR.
Porting SigWin-detector to R
The original SigWin-detector workflow was rewritten in the R language, except for the median calculation of the moving windows, which was programmed in C to achieve an acceptable performance. SigWinR can produce a Ridgeogram for a sequence containing 10,000 elements in less than 1 minute using a modern desktop computer with an Intel® Core™ 2 Quad Q8200 Processor running at 2.33 GHz with 2 GB of RAM. Hence, it is feasible to analyze whole eukaryotic genomes within a practical timeframe. SigWinR has been developed on R-2.8.0 and has been tested on a Linux and Microsoft Windows environment. The package has been validated with the data sets used in  (results not shown). Help and documentation is available in the R package, which can be downloaded from the Comprehensive R Archive Network (CRAN, http://cran.r-project.org/).
Visualizing input-sequence values
Ridgeograms are the standard output of SigWinR. To support the interpretation of the produced Ridgeogram, we have added a XY-plot below the Ridgeogram containing the values of all elements in the sequence. All Figures show examples of this visualization.
Visualizing enriched windows of pre-defined size
Ridgeograms are visualizations of significant windows using all possible window sizes. However, it often occurs that the most interesting scale on which to analyze a sequence is known. For those cases, we have added an extra option in SigWinR that allows identification of significantly enriched windows for a pre-defined subset of window sizes. This makes the analysis considerably more efficient. In Figure 1 (upper and lower right panel) an example of this visualization is shown.
Analyzing circular sequences
Since SigWin-detector originates from the life sciences, an obvious extension of SigWinR is the possibility to analyze circular sequences, such as bacterial genomes. As in a linear sequence, the largest meaningful size of the moving window in a circular sequence equals the total length of that sequence. However, where in linear sequences the moving windows decrease in number as they increase in size (hence the triangular shape of the Ridgeogram), in a circular sequence, all possible window sizes including the largest, can still travel across the entire sequence. Therefore the Ridgeogram is rectangular. An example of an analysis of a circular sequence is shown in Figure 2. Relevant observations could be missed, if circular nucleotide sequences are analyzed as artificially linearized sequences based on an arbitrary cut, such as the origin of replication in bacterial genomes.
Identifying elements with altered values that are co-localized in a sequence
In life sciences, it is often interesting to identify regions in a sequence in which many of the elements have an altered value (ROCAGEs) in the context of an experimental contrast. Thus, when using for instance gene-expression data, instead of identifying RIDGES, which requires data from many experiments, we would like to be able to identify ROCAGEs within single experiments. To accomplish this, SigWin-detector can be fed an input sequence consisting of gene-expression log ratios or, in a replicated experiment per gene t-values or per gene p-values. As an example we investigated gene-expression data concerning Down syndrome that is typified by a trisomy of chromosome 21 (Figure 3 and Additional files 2 and 3) . In chromosomal regions that are duplicated such as the Down syndrome chromosome 21, one expects to find co-localized genes with altered gene expression. Indeed Figure 3 and Additional files 2 and 3 show ROCAGES for the Down chromosome 21 vs. control tissue using t-values and p-values as input.
Considering the spatial distribution of sequence elements
The spatial distribution of genes on chromosomes is not uniform. In the previous implementation  this was solved by re-sampling the input sequence, which distorts the data. Here we have implemented a method in which the distribution is taken into account by representing the data as a non-regularly sampled sequence. This sequence is a sequence of position, value pairs. In SigWinR, a new function (PosRidgeogram) is available, which calculates a Ridgeogram that uses windows based on the physical location of the elements in the underlying sequence. Medians are calculated for a sequence of, in physical space regularly spaced, overlapping windows that now may contain a variable number of elements. P-values for the median are calculated using the presented exact formula based on the number of elements in the window. In Figure 1 and Additional file 4 results from calculations with the HTM  are shown that were obtained using the PosRidgeogram function of SigWinR. Because gene density is not equally distributed along the chromosome, as illustrated by the region near the centromere with low sample density in the lower Ridgeplot in Figure 1, the PosRidgeogram differs from the Ridgeogram. Because a position on the x-axis represents a position on the chromosome the PosRidgeogram can be interpreted in terms of position.
Extending the algorithm
A consequence of the approach that takes physical position into account is that windows may contain an even number of elements. The SigWin-detector algorithm avoids even window sizes, because the previously presented exact formula  is only suited for windows with an uneven number of elements and the use of those uneven windows results in a Ridgeogram with sufficient resolution. To address this, we derived a formula for even windows that calculates the p-values associated with a certain median given sequence length and window size. This formula is presented in Additional file 5.
SigWinR is an R implementation of the grid-based SigWin-detector application for a desktop computer. It has an adequate performance and makes the SigWin algorithm available to a much wider audience than just the grid community. Also, with SigWinR, a number of improvements have been introduced, both in the algorithms used and the visualizations that can be produced. For future developments, we are considering parallelization of the SigWinR package.
Availability and requirements
Project name: SigWinR
Project home page: http://mad-db.science.uva.nl/projects/SigWinR/
Programming language: R
Other requirements: -
Inda MA, van Batenburg MF, Roos M, Belloum AS, Vasunin D, Wibisono A, van Kampen AH, Breit TM: SigWin-detector: a Grid-enabled workflow for discovering enriched windows of genomic features related to DNA sequences. BMC Res Notes. 2008, 1: 63-10.1186/1756-0500-1-63.
Versteeg R, van Schaik BD, van Batenburg MF, Roos M, Monajemi R, Caron H, Bussemaker HJ, van Kampen AH: The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 2003, 13: 1998-2004. 10.1101/gr.1649303.
Ihaka R, Gentleman R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299-314. 10.2307/1390807.
Lockstone HE, Harris LW, Swatton JE, Wayland MT, Holland AJ, Bahn S: Gene expression profiling in the adult Down syndrome brain. Genomics. 2007, 90: 647-660. 10.1016/j.ygeno.2007.08.005.
Audit B, Ouzounis CA: From genes to genomes: universal scale-invariant properties of microbial chromosome organisation. J Mol Biol. 2003, 332: 617-633. 10.1016/S0022-2836(03)00811-8.
Hu J, Zhao X, Yu J: Replication-associated purine asymmetry may contribute to strand-biased gene distribution. Genomics. 2007, 90: 186-194. 10.1016/j.ygeno.2007.04.002.
This work was carried out in the context of: the Virtual Laboratory e-Science project http://www.vl-e.nl supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and the ICT innovation program of the Ministry of Economic Affairs (EZ); and BioRange program of the Netherlands Bioinformatics Centre (NBIC) supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).
The authors declare that they have no competing interests.
WdL specified and implemented the SigWinR package.
HR, MAI and TB all worked on the specification of the SigWinR package and adapted it by discussing applicability of it with life scientists.
OB analyzed the Down syndrome gene expression data.