Using R in Taverna: RShell v1.2
© Rauwerda et al; licensee BioMed Central Ltd. 2009
Received: 19 February 2009
Accepted: 16 July 2009
Published: 16 July 2009
R is the statistical language commonly used by many life scientists in (omics) data analysis. At the same time, these complex analyses benefit from a workflow approach, such as used by the open source workflow management system Taverna. However, Taverna had limited support for R, because it supported just a few data types and only a single output. Also, there was no support for graphical output and persistent sessions. Altogether this made using R in Taverna impractical.
We have developed an R plugin for Taverna: RShell, which provides R functionality within workflows designed in Taverna. In order to fully support the R language, our RShell plugin directly uses the R interpreter. The RShell plugin consists of a Taverna processor for R scripts and an RShell Session Manager that communicates with the R server. We made the RShell processor highly configurable allowing the user to define multiple inputs and outputs. Also, various data types are supported, such as strings, numeric data and images. To limit data transport between multiple RShell processors, the RShell plugin also supports persistent sessions. Here, we will describe the architecture of RShell and the new features that are introduced in version 1.2, i.e.: i) Support for R up to and including R version 2.9; ii) Support for persistent sessions to limit data transfer; iii) Support for vector graphics output through PDF; iv)Syntax highlighting of the R code; v) Improved usability through fewer port types.
Our new RShell processor is backwards compatible with workflows that use older versions of the RShell processor. We demonstrate the value of the RShell processor by a use-case workflow that maps oligonucleotide probes designed with DNA sequence information from Vega onto the Ensembl genome assembly.
Our RShell plugin enables Taverna users to employ R scripts within their workflows in a highly configurable way.
The open source workflow system Taverna  provides access to and integration of many life science web services. However, not all desired data-analysis procedures are available as web services. Therefore, support for scripting is essential and by default, Taverna provides scripting in Java. Many scientists are not familiar with Java and prefer other languages, such as R, a statistical language commonly used by many life scientists . Taverna had limited support for R. We developed an R plugin for Taverna: RShell, which removes the limitations of R support in Taverna workflows. RShell can be incorporated in a Taverna workflow as a processor for executing R scripts. Although RShell was already sketchily introduced in Li et al. , here we present a detailed description of the RShell functionality and architecture. This was also motivated by several requests from Taverna and R users. Additionally, we introduce new functionality of our current RShell version 1.2.
Configuring the RShell processor
The RShell processor is highly configurable: the user can define the script to be executed, the input ports to feed the relevant data, and the output ports to extract the created data. The user can configure the RShell processor by invocating the RShell processor in Taverna's "advanced model explorer". This will show a configuration dialog containing four tabs for the script, the input ports, the output ports and the connection settings.
Syntax highlighting is available in the scripting tab, to help the user write correct R code. The keywords are highlighted in blue; the inputs and outputs of the RShell are highlighted in pink.
RShell supports multiple input and output ports. Each port has a name, corresponding to the variable name it represents, as well as a type, corresponding to the type of data it holds. RShell uses the data type to determine how to exchange data with Taverna. RShell supports booleans, numeric values (both integers and floating point numbers), strings, and, vector of numerics and strings. Inputs and outputs of these data types can directly be used as variables in the R script. RShell provides the text-file data type to handle large data. Ports of this type can be read and written as normal R tables. The PNG and, since version 1.2, the PDF data type are available for graphical output. PNG can be used bitmap graphics; PDF for vector graphics. PNG and PDF outputs are handled as graphics devices in the standard R fashion. Input ports and output ports can be defined using the input ports and output ports tab.
RShell can execute any R script as long as the required libraries are installed in the R server. By default, the RShell processor is configured to use a locally installed R server serving at address localhost, port 6311. It can be configured to use a remote installation of R instead, using the connections tab. This can be useful when, the user is not able to install R, a central installation of R with a specific set of installed libraries is used, or R is installed in a grid environment. Multiple R servers can be accessed within the same workflow.
RShell supports persistent sessions to prevent unnecessary data transfer. The user can enable persistent sessions in the connection tab. When these are enabled, all input data, output data and script variables will be kept in memory of the R-interpreter until the whole workflow execution is finished. Multiple RShell processors in the same workflow are able to use the data provided by previously used processors, without requiring data links between these processors, as long as they use the same R server, Although persistent sessions can be very useful, they have some limitations: i) sessions can only be used among RShell processors, ii) RShell processors involved in a single persistent session have to access the same R server (but RShells devoted to a different session can access other R servers), and iii) it takes extra effort to keep a provenance log of data kept at the R interpreter. Taverna manages all data consumed and produced by processors. By using persistent sessions, Taverna is not aware of the data generated by one RShell processor and used by another RShell processor. If the user wants to record data generated by an RShell processor, he/she can do that by defining an output port for that variable.
Use-case: mapping oligonucleotides to a genome DNA-sequence
With the introduction of our RShell plugin, scripting in R is now available for Taverna. RShell has a configurable processor that is able to execute any R script that can be executed by the available installation of the R-interpreter. The client-server architecture enables a centralised installation of the R-interpreter containing all frequently used libraries used by an organisation. The support for persistent sessions in RShell 1.2 helps to prevent data transfer overload.
RShell 1.0 comes with the standard installation of Taverna. RShell 1.2 can be downloaded as a plugin for Taverna and contains several improvements, such as support for vector graphics and persistent sessions
Availability and requirements
Project name: RShell v1.2
Project home page: http://ewi.utwente.nl/~biorange/rshell/
Operating system(s): Any (Java)
Programming language: Java
Other requirements: Java 1.6, Taverna 1.3+ and R with the Rserve library installed. Taverna, R and Rserve are all open source and freely available.
Licence: GNU GPL
This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).
- Oinn T, Li P, Kell D, Goble C, Goderis A, Greenwood M, Hull D, Stevens R, Turi D, Zhao J: Workflows for e-Science. Berlin, DE: Springer Verlag 2007 chap. Taverna/myGrid: Aligning a Workflow System with the Life Sciences Community, 300-319.Google Scholar
- Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5 (3): 299-314. 10.2307/1390807.Google Scholar
- Li P, Castrillo JI, Velarde G, Wassink I, Soiland-Reyes S, Owen S, Withers D, Oinn T, Pocock MR, Goble CA, Oliver SG, Kell DB: Performing statistical analyses on quantitative data in Taverna workflows: an example using R and maxdBrowse to identify differentially-expressed genes from microarray data. BMC Bioinformatics. 2008, 9 (334): 1-38.Google Scholar
- Urbanek S: Rserve – A Fast Way to Provide R Functionality to Applications. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). Vienna, Austria. Edited by: Hornik KZA, Leisch F. 2003, 20-22.Google Scholar
- He Z, Wu L, Li X, Fields M, Zhou J: Empirical establishment of oligonucleotide probe design criteria. Applied and Environmental Microbiology. 2005, 71 (7): 3753-3760. 10.1128/AEM.71.7.3753-3760.2005.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.