SigWin-detector: a Grid-enabled workflow for discovering enriched windows of genomic features related to DNA sequences
© Inda et al; licensee BioMed Central Ltd. 2008
Received: 25 April 2008
Accepted: 08 August 2008
Published: 08 August 2008
Chromosome location is often used as a scaffold to organize genomic information in both the living cell and molecular biological research. Thus, ever-increasing amounts of data about genomic features are stored in public databases and can be readily visualized by genome browsers. To perform in silico experimentation conveniently with this genomics data, biologists need tools to process and compare datasets routinely and explore the obtained results interactively. The complexity of such experimentation requires these tools to be based on an e-Science approach, hence generic, modular, and reusable. A virtual laboratory environment with workflows, workflow management systems, and Grid computation are therefore essential.
Here we apply an e-Science approach to develop SigWin-detector, a workflow-based tool that can detect significantly enriched windows of (genomic) features in a (DNA) sequence in a fast and reproducible way. For proof-of-principle, we utilize a biological use case to detect regions of increased and decreased gene expression (RIDGEs and anti-RIDGEs) in human transcriptome maps. We improved the original method for RIDGE detection by replacing the costly step of estimation by random sampling with a faster analytical formula for computing the distribution of the null hypothesis being tested and by developing a new algorithm for computing moving medians. SigWin-detector was developed using the WS-VLAM workflow management system and consists of several reusable modules that are linked together in a basic workflow. The configuration of this basic workflow can be adapted to satisfy the requirements of the specific in silico experiment.
As we show with the results from analyses in the biological use case on RIDGEs, SigWin-detector is an efficient and reusable Grid-based tool for discovering windows enriched for features of a particular type in any sequence of values. Thus, SigWin-detector provides the proof-of-principle for the modular e-Science based concept of integrative bioinformatics experimentation.
Genomic information is encoded in DNA and as such retained in a fairly steady configuration. In contrast to RNA, proteins, and metabolites, DNA is organized by a limited number of large chromosomes with relatively stable DNA sequences. Therefore, position in the DNA sequence, i.e., chromosome location, provides a convenient and essential scaffold for both the living cell and molecular biological research. In cells, for example, chromosomal organization is important for gene-transcription processes. Expression-profiling studies showed that gene expression is not only controlled at the level of individual genes, but also via autonomous regulation of chromosomal domains [1–5]. This suggests the existence of higher-order transcriptional regulatory mechanisms related to DNA organization or structures. The use of chromosomal organization in the life sciences is exemplified by the popularity of genome browsers that use chromosome location to map many genomic features, such as genes and their products, regulatory elements, gene expression, and epigenetic markers. The search for connections between genomic features is important in unraveling cellular mechanisms.
The pace at which omics experiments continuously keep producing large amounts of data about genomic features for an increasing number of sequenced genomes, creates a need for new high-throughput methods for identification of correlations between DNA related features [6–12]. Therefore, biologists would benefit from tools that could quickly identify enriched regions of genomic features. This would allow extensive, yet convenient in silico experimentation based on routinely processing and comparing multiple datasets. However, this requires these tools to be implemented in such a way that they deal with the many steps involved in this kind of experimentation. These include: acquiring the data from local or remote data repositories, converting it to the desired format, using it with the actual application that searches for the desired enrichment (possibly using Grid computation), visualizing the results, and comparing and/or integrating multiple datasets. Therefore, such a tool should be developed applying an e-Science approach [13–17]: it should be generic with respect to which data it can analyze, easy to adapt, and its parts should be reusable.
In an e-Science approach, a computational environment that provides transparent access to distributed data, adequate computational resources, as well as the necessary interfacing tools, is called a virtual laboratory (VL). Workflow management systems (WMSs, [18–21]) are an example of interfacing tooling that takes care of scheduling, keeps track of task executions, and provides the management framework necessary to develop applications inside a VL. WMSs can be used to design scientific workflows that automate in silico experimentation by providing a pipeline for streaming large quantities of data through various algorithms, applications and services.
This paper describes an e-Science based data integration and analysis tool: SigWin-detector. This application can detect clusters with increased (or decreased) density of a genomic feature in a DNA-related sequence in a fast and reproducible way. In the context of the development of a VL, our tool was implemented as a workflow running under WS-VLAM[20, 21], a Grid-enabled WMS. A biological use case shows its relevance for biological research. SigWin-detector is based on a method previously used by Versteeg and coworkers  to detect regions of increased and decreased gene expression (RIDGEs and anti-RIDGES) in human transcriptome maps (HTM). We improved the original method by i) deriving an analytical formula for computing the new hypothesis probability distribution, which replaces the costly step of estimation by random sampling and ii) developing a new algorithm for computing moving medians. While these improvements radically increase the intrinsic efficiency of the method, implementing SigWin-detector using a generic e-Science approach with access to Grid resources broadens its applicability and makes it amenable to a wide spectrum of experiments on genomic features or in fact on any sequence of values.
Significant windows and the mmFDR procedure
Avoiding permutations in the mmFDR procedure
Computationally, the most expensive step in the original mmFDR procedure is the repeated determination of medians over sliding windows of permutations of the input data to estimate the probability function corresponding to the null hypothesis. Our first improvement to the original method was to derive an exact formula for this distribution (see definitions and derivation in Additional file 1):
This exact formula reduces the number of cycles of computing moving medians of an input sequence of approximately 25,000 entries from at least 5,000 to 1, giving SigWin-detector the efficiency it needs to be used routinely and for processing and comparing multiple datasets within minutes to hours, instead of days. This efficiency could not be if f(m) was estimated by sampling the permutation space E π , and counting the number of times m was the median value in any sliding window of size S.
Speeding up the computation of moving medians
Additional Figure A1 (Additional file 2) shows a graph comparing our moving medians algorithm with the commonly used Hardle and Steiger's algorithm . While the execution time of their algorithm increases with window size (for a fixed sequence size), the execution time of our algorithm decreases with window size (Figure A1, upper panel). Because SigWin-detector needs to compute moving medians for many window sizes, our algorithm has a clear advantage over Hardle and Steiger's algorithm. In Figure A1, the break-even point of the cumulative computation is for S max around 400. The efficiency of our method can be further improved by using a mixed algorithm that uses Hardle and Steiger's algorithm for small window sizes and our algorithm for large window sizes, or by employing a divide-and-conquer approach. For example, a two-phase algorithm would start by dividing the input sequence into chunks of size 2M, with M ≥ 2S max , and applying the original algorithm to each chunk separately. Similarly, the second phase computes the medians for the missing sliding windows by dividing the sequence into chunks of the same size, but now using an offset M. This two-phase algorithm is also suitable for parallelization.
Designing a Grid-enabled generic workflow
To broaden the applicability of the mmFDR procedure, we implemented SigWin-detector using an e-Science approach by implementing a general, reusable, and adaptable tool with access to Grid resources using the WS-VLAM workflow management system[20, 21].
The SigWin-detector Config-Basic1 workflow was tested on a Grid computer cluster composed of geographically distributed computational nodes: Distributed ASCI Supercomputer 3 (DAS-3, ). Additional Figure A2 (Additional file 2) presents wall clock execution times of the SigWin-detector Config-Basic1 workflow (Figure 3) for input sequences of various sizes.
The basic workflow can be altered by substituting, deleting, or adding modules. For example, we can extend the workflow to get the input sequence from a remote uniform resource identifier (URI)and then put the resulting SigWin-map back into it. We can modify the workflow to generate one SigWin-map per logical subsequence of the input sequence, instead of a single SigWin-map for the complete sequence . We can also expand our workflow by computing significant windows for high median values (e.g., RIDGEs) and significant windows for low median values (e.g., anti-RIDGEs) simultaneously. The SigWin-detector workflow itself can be made into a "composite module" for more complex workflows. Furthermore, interconnection of WS-VLAM with the TAVERNA workbench  will permit the use of the existing TAVERNA components in connection with SigWin-detector. At the moment, Grid authentication prevents WS-VLAM workflows being used outside the Grid without the extra step of Grid certification. However, we are working on a Taverna workflow that encapsulates the SigWin detector, to be made available through the myExperiment webpage .
Biological application: finding RIDGES in a human transcriptome map
HTM statistical data
all window sizes
window sizes 19–59
all window sizes
window sizes 19–59
The RIDGEOGRAMS shown in Figures 4 and 5 only take the ordering of the genes into account, and not their actual physical position in the chromosome. However, from a biological perspective it is likely that the higher order gene-expression mechanisms that underlie RIDGEs relate to an actual section of the chromosome rather than a cluster of genes just ordered by their chromosome location. So we used our SigWin-detector to take the physical gene position into account by subdividing the chromosomes in stretches of constant value (250 kb). If a stretch contains the beginning of one or more genes, their average expression value is assigned to that stretch of DNA. For this analysis we used the SigWin-detector Config-Sub2 with preprocessed HTM data and adapted parameters. The resulting RIDGEOGRAMS are proportional to the chromosome's size (Additional Figure A3, Additional file 2). The anti-RIDGEs show a lower cut-off caused by the many 0 values in the HTM. The results from the SigWin-detector analysis using chromosome position are substantially different to those using chromosome ordering. This application demonstrated that SigWin-detector is an e-Science tool that allows convenient in-silico experimentation. To prove that this tool is generic, we used our workflow to examine a simple sequential data set: an extended time series of hourly ground level ozone concentration measurements (Additional file 4).
Availability and requirements
Project name: SigWin-detector
Project home page: http://mad-db.science.uva.nl/projects/sigwin/
Programming language: C++
Other requirements: SigWin-detector needs the WS-VLAM workflow management system. WS-VLAM has a client distribution and site distribution.
i. WS-VLAM client distribution: The WS-VLAM composer, a graphical interface used for creating, modifying, and submitting workflows. Needs Java virtual machine (version1.5 or higher).
ii. WS-VLAM site distribution: The WS-VLAM engine, which is needed for running the workflows in a Grid. The WS-VLAM engine needs a GLOBUS GT4 (4.0.3) installation.
We thank R. Monajemi for assistance with the HTM data sets, R. H. Bisseling for checking the mathematics, L. O. Hertzberger for his constant support, and J. Batson for proofreading the paper. This work was carried out in the context of the Virtual Laboratory e-Science project http://www.vl-e.nl and BioRange program of the Netherlands Bioinformatics Centre (NBIC). VL-e is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and the ICT innovation program of the Ministry of Economic Affairs (EZ). BioRange is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).
- Spellman PT, Rubin GM: Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol. 2002, 1: 5-10.1186/1475-4924-1-5.PubMed CentralView ArticlePubMedGoogle Scholar
- Boutanaev AM, Kalmykova AI, Shevelyov YY, Nurminsky DI: Large clusters of co-expressed genes in the Drosophila genome. Nature. 2002, 420: 666-669. 10.1038/nature01216.View ArticlePubMedGoogle Scholar
- Roy PJ, Stuart JM, Lund J, Kim SK: Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans. Nature. 2002, 418: 975-979.PubMedGoogle Scholar
- Versteeg R, van Schaik BD, van Batenburg MF, Roos M, Monajemi R, Caron H, Bussemaker HJ, van Kampen AH: The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 2003, 13: 1998-2004. 10.1101/gr.1649303.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Hua C, Man Y, Rosenzweig E, Goldy J, Haydock A, Weaver M, Shafer A, Lee K, Neri F, Humbert R, Singer MA, Richmond TA, O Dorschner M, McArthur M, Hawrylycz M, Green RD, Navas PA, Noble WS, Stamatoyannopoulos JA: Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nature Methods. 2006, 3: 511-518. 10.1038/nmeth890.View ArticlePubMedGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng ZP, Snyder M, Dermitzakis ET, Stamatoyannopoulos JA, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Dutta A, Guigo R, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng DY, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Flicek P, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermuller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu JQ, Lian Z, Lian J, Newburger P, Zhang XQ, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Dermitzakis ET, Margulies EH, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan YJ, Snyder M, Birney E, Struhl K, Gerstein M, Antonarakis SE, Gingeras TR, Brown JB, Flicek P, Fu YT, Keefe D, Birney E, Denoeud F, Gerstein M, Green ED, Kapranov P, Karaoz U, Myers RM, Noble WS, Reymond A, Rozowsky J, Struhl K, Siepel A, Stamatoyannopoulos JA, Taylor CM, Taylor J, Thurman RE, Tullius TD, Washietl S, Zheng DY, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Collins FS, Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Birney E, Keefe D, Hou MM, Taylor J, Nikolaev S, Montoya-Burgos JI, Loytynoja A, Whelan S, Pardi F, Massingham T, Brown JB, Huang HY, Zhang NR, Bickel P, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, Gerstein M, Antonarakis SE, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Pachter L, Green ED, Sidow A, Weng ZP, Trinklein ND, Fu YT, Zhang ZDD, Karaoz U, Barrera L, Stuart R, Zheng DY, Ghosh S, Flicek P, King DC, Taylor J, Ameur A, Enroth S, Bieda MC, Koch CM, Hirsch HA, Wei CL, Cheng J, Kim J, Bhinge AA, Giresi PG, Jiang N, Liu J, Yao F, Sung WK, Chiu KP, Vega VB, Lee CWH, Ng P, Shahab A, Sekinger EA, Yang A, Moqtaderi Z, Zhu Z, Xu XQ, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Clelland GK, Wilcox S, Dillon SC, Andrews RM, Fowler JC, Couttet P, James KD, Lefebvre GC, Bruce AW, Dovey OM, Ellis PD, Dhami P, Langford CF, Carter NP, Vetrie D, Kapranov P, Nix DA, Bell I, Patel S, Rozowsky J, Euskirchen G, Hartman S, Lian J, Wu JQ, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu CX, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang XL, Xu MS, Haidar JNS, Yu Y, Birney E, Weissman S, Ruan YJ, Lieb JD, Iyer VR, Green RD, Gingeras TR, Wadelius C, Dunham I, Struhl K, Hardison RC, Gerstein M, Farnham PJ, Myers RM, Ren B, Snyder M, Thomas DJ, Rosenbloom K, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Haussler D, Kent WJ, Dermitzakis ET, Armengol L, Bird CP, Clark TG, Cooper GM, de Bakker PIW, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Thomas DJ, Woodroffe A, Batzoglou S, Davydov E, Dimas A, Eyras E, Hallgrimsdottir IB, Hardison RC, Huppert J, Sidow A, Taylor J, Trumbower H, Zody MC, Guigo R, Mullikin JC, Abecasis GR, Estivill X, Birney E, Bouffard GG, Guan XB, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang HY, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu BL, de Jong PJ: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.View ArticlePubMedGoogle Scholar
- Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, Haefliger C, Horton R, Howe K, Jackson DK, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K, Beck S: DNA methylation profiling of human chromosomes 6, 20 and 22. Nature Genetics. 2006, 38: 1378-1385. 10.1038/ng1909.PubMed CentralView ArticlePubMedGoogle Scholar
- van Steensel B: Mapping of genetic and epigenetic regulatory networks using microarrays. Nature Genetics. 2005, 37: S18-S24. 10.1038/ng1559.View ArticlePubMedGoogle Scholar
- Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genetics. 1999, 23: 41-46. 10.1038/14385.View ArticlePubMedGoogle Scholar
- Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB: Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99: 4465-4470. 10.1073/pnas.012025199.PubMed CentralView ArticlePubMedGoogle Scholar
- Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, Lancet D, Shmueli O: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005, 21: 650-659. 10.1093/bioinformatics/bti042.View ArticlePubMedGoogle Scholar
- Halasz G, van Batenburg MF, Perusse J, Hua S, Lu XJ, White KP, Bussemaker HJ: Detecting transcriptionally active regions using genomic tiling arrays. Genome Biology. 2006, 7:Google Scholar
- Rauwerda H, Roos M, Hertzberger BO, Breit TM: The promise of a virtual lab in drug discovery. Drug Discovery Today. 2006, 11: 228-236. 10.1016/S1359-6446(05)03680-9.View ArticlePubMedGoogle Scholar
- Goble C: The low down on e-science and grids for biology. Comparative and Functional Genomics. 2001, 2: 365-370. 10.1002/cfg.115.PubMed CentralView ArticlePubMedGoogle Scholar
- Oehmen CS, Straatsma TP, Anderson GA, Orr G, Webb-Robertson BJM, Taylor RC, Mooney RW, Baxter DJ, Jones DR, Dixon DA: New challenges facing integrative biological science in the post-genomic era. Journal of Biological Systems. 2006, 14: 275-293. 10.1142/S0218339006001805.View ArticleGoogle Scholar
- Inda MA, Belloum ASZ, Roos M, Vasunin D, de Laat C, Hertzberger LO, Breit TM: Interactive Workflows in a Virtual Laboratory for e-Bioscience: the SigWin-Detector Tool for Gene Expression Analysis. Proceedings of the e-Science 2006; Amsterdam. 2006, IEEE CS PressGoogle Scholar
- Post LJG, Roos M, Marshall MS, van Driel R, Breit TM: A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data. Bioinformatics. 2007, 23: 3080-3087. 10.1093/bioinformatics/btm461.View ArticlePubMedGoogle Scholar
- Ludascher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y: Scientific workflow management and the Kepler system. Concurrency and Computation-Practice & Experience. 2006, 18: 1039-1065. 10.1002/cpe.994.View ArticleGoogle Scholar
- Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Research. 2006, 34: W729-W732. 10.1093/nar/gkl320.PubMed CentralView ArticlePubMedGoogle Scholar
- Korkhov V, Vasunin D, Wibisono A, Belloum ASZ, Inda MA, Roos M, Breit T, Hertzberger BLO: VLAM-G: Interactive Dataflow Driven Engine for Grid-enabled Resources. Scientific Programming. 2007, 15: 173-188.View ArticleGoogle Scholar
- WS-VLAM. [http://www.science.uva.nl/~gvlam/wsvlam]
- Hardle W, Steiger W: Optimal Median Smoothing. Applied Statistics-Journal of the Royal Statistical Society Series C. 1995, 44: 258-264.Google Scholar
- DAS3, The Distributed ASCI Supercomputer 3. [http://www.cs.vu.nl/das3]
- Goble C, Roure DCD: myExperiment: social networking for workflow-using e-scientists. Proceedings of the 2nd workshop on Workflows in support of large-scale science; June 25, 2007; Monterey, California, USA. 2007, ACM Press, 1-2.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.