A case study for efficient management of high throughput primary lab data
© Colmsee et al; licensee BioMed Central Ltd. 2011
Received: 28 July 2011
Accepted: 17 October 2011
Published: 17 October 2011
In modern life science research it is very important to have an efficient management of high throughput primary lab data. To realise such an efficient management, four main aspects have to be handled: (I) long term storage, (II) security, (III) upload and (IV) retrieval.
In this paper we define central requirements for a primary lab data management and discuss aspects of best practices to realise these requirements. As a proof of concept, we introduce a pipeline that has been implemented in order to manage primary lab data at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK). It comprises: (I) a data storage implementation including a Hierarchical Storage Management system, a relational Oracle Database Management System and a BFiler package to store primary lab data and their meta information, (II) the Virtual Private Database (VPD) implementation for the realisation of data security and the LIMS Light application to (III) upload and (IV) retrieve stored primary lab data.
With the LIMS Light system we have developed a primary data management system which provides an efficient storage system with a Hierarchical Storage Management System and an Oracle relational database. With our VPD Access Control Method we can guarantee the security of the stored primary data. Furthermore the system provides high performance upload and download and efficient retrieval of data.
The data domain is manifold and ranges from DNA mapping and sequences, gene expression, proteomics to phenotyping. Over the past several decades, many of the life sciences have been transformed into high-throughput areas of study . In a number of cases, the rate at which data can now be generated has increased by several orders of magnitude. Such dramatic expansions in data throughput have largely been enabled by engineering innovation, e.g. hardware advancements and automation. In particular, laboratory tasks that were once performed manually are now carried out by robotic fixtures. The growing trend towards automation continues to drive the urgent need for proper IT support . Here, we use the acronym IT to name those infrastructure services that (a) process and analyse data, (b) organise and store data and provide structured data handling capability and (c) support the reporting, editing, arrangement and visualisation of data. The aim is to reflect the biologists data ⇒ information ⇒ knowledge paradigm  and to meet the requirements of high-throughput data analysis and the resulting data volume of petabytes .
The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Germany, is a research centre, which primarily applies the concepts and technologies of modern biology to crop plants and hosts the German central ex situ gene bank. There are many scientific fields at the IPK producing high throughput data, such as plant phenotyping (http://www.lemnatec.com) and 454-sequencing . One recurring task for bioinformatics at the IPK is the implementation of databases and information systems. To handle, manage and secure a huge amount of data properly, it is always a great advantage to have a well trained bioinformatics and IT staff which takes care for it and assumes responsibility for the valuable electronic information. We learned which design principles and implementation techniques are useful and which have to be avoided. Subsequently, we abstracted a best practice and proven implementation concept as experiences from several bioinformatics projects: SEMEDA (Semantic Meta Database) for providing semantically integrated access to databases and to edit and maintain ontologies and controlled vocabularies , MetaCrop, a manually curated repository of high quality information about the metabolism of crop plants  and an integration and analysis pipeline for systems biology in crop plant metabolism . In this paper we summarise these experiences into best practice for primary data management. First, we will present the requirements of primary data management. Afterwards, we will give an overview to different solutions of primary data management. Furthermore, we will present our results and we will last but not least list our used methods.
The major aspects of primary data management
There are four major aspects about the handling of primary data. Today, automated techniques produces a large amount of data. It is necessary to find solutions for large storage capacity and efficient backup mechanism in a preferably affordable way. Especially during the life span of a project the data access has to be restricted to project related users. Finding a way to manage the access rights for users and even groups within the primary data management will be another important aspect. One most important fact about primary data management is the performance of storing files and an efficient retrieval of the stored data. The third main aspect would be to realise such an efficient system. The last important aspect is the usability of the system. The user interface has to be simple and intuitive for the different biological groups using the system.
The central requirements of primary data management
Referring to the four major aspects of primary data management and the architecture in Figure 2 there can be derived several requirements for managing high throughput primary lab data:
storage of metadata from files for later retrieval
central storage of primary lab data for decentralised computation
versioning of data
upload and retrieval of data for authorised users
invariability of data has to be guaranteed
data upload and retrieval:
- fast upload of large amount of data
automation of data upload (including automated tagging of files)
fast and efficient retrieval of stored files
Best practices for the above listed requirements are presented in the following sections.
Solutions for primary data management
A major aspect in scientific research is the citation of data for publications. Anderson et al.  show, that there is a increasing number of publications containing supplementary material and that data has to be long-term archived. But often data is not longer available, especially for elder publications. To make data citable a persistent identifier linked with the data is needed. Such identifier is provided by DOI (Digital Object Identifier, http://www.doi.org). One system working with DOIs is the Pangaea system (http://www.pangaea.de/), a publishing network for geoscientific and environment data. It uses TIB, the German National Library of science and technology, which provides services for DOI registration. TIB is a member of DataCite, an organisation with currently 15 members and 4 associate members, which has the main goal to guarantee easy access to scientific data (http://www.datacite.org). Another motivation is shown by Piwowar et al. . There is a correlation between the publication of research data and the number of citations. Publications with shared data are more often cited than publications without any publicly available data.
The creation of central data repositories is a further task of scientific research. One example of such a repository is GabiPD . GabiPD stores data from high throughput experiments of different plant species and further provides methods to analyse and visualise the stored data. The online archive PsychData  is an open access central repository which provides long term documented data over all aspects of psychology. With this archive the scientific community has a platform to store and to exchange their data with other scientists, which leads to a better collaborative scientific usability of the data. Another system to store scientific data is the LabKey Server . The main advantage of this platform is the ability to integrate, analyse and visualise data from diverse data sources, which is very useful in large research consortia. For digital long term storage, in Germany the nestor organisation was founded (http://www.longtermpreservation.de). It provides a guideline of standardisation activities regarding persistent identifiers, metadata, file formats, certification, reference models and records management. Nestor is also partner in the European community APA (alliance for permanent access, http://www.alliancepermanentaccess.org/).
Results and discussion
The LIMS Light system
Efficient storage system
We have implemented a system that balances between long-term storage and fast data access by using the HSM technology. The hard disc layer guarantees a fast data access and with the tape layer we have a cheap storage media in relation to hard disks. The HSM is configured in a specific way to fulfil our storage policy. Every twenty minutes new files are copied to two tapes. Regularly the second copy is removed from the library and is stored in an external shelf. Each file is kept on the disc cache for at least five minutes. If the disc space is fallen below five percent of capacity the oldest files are released first. But the first 16K of every file is kept on the disc. Additionally, we have an efficient way to backup our data. Instead of storing the data as BLOBs in the database and with those using expensive hard disks we use Oracle BFILEs to link the metadata with the files stored on the HSM. Thus the metadata backup within the database would be very efficient. Furthermore, the HSM supports an automatic backup from all files because they are stored twice on the tape layer.
Proven data access control method
By using the row level security concept, we support fine-grained data access control that guarantees data privacy for all stored primary data. The row level security concept is an aproach which is used in various database systems. Commercial DBMS offer row level security implementations like DB2 with Label Based Access Control (LBAC), as well as open source systems like Security Enhanced PostgreSQL (SE-PostgreSQL). The advantage of this concept is that a project manager has an application and database query language independent method to secure operations and table rows at the database level. This is achieved by the automated modification of database queries by the database system itself. The benefit is the possibility of granting direct SQL access without injuring authorisation policies. With this system we are able to manage all users and workgroups from the IPK and can also grant access for external project partners to the LIMS Light system without coding the authorisation policies in a middleware or frontend.
Efficient upload and retrieval of primary data
The LIMS Light data
In LIMS Light we are able to store primary data of many different data domains. Currently there are stored more than 30.000 files in the system with a total data volume of approximately 900 GB. With 307 GB, sequence data consumes the most storage space followed by image data (145 GB) and textual data (135 GB).
A functional overview of LIMS Light
As mentioned before, LIMS Light has three main components, the management, the upload, and the retrieval of data. With the management component the user can create and modify all objects defined in the workflow. When creating a project the user can specify the project name and description, the responsible person for the project and the project start and end timepoint. Additionally, it can be defined which users and groups will have access to the project. The creation of an experiment includes information about the experiment name, description and date. The user can also assign material and associate ontologies to the experiment. The access rights for users and groups can be specified here as well as in the projects part. In the workset view the user can create new worksets including their description. Within the worksets, single or multiple files can be uploaded with the LIMS Light Uploader. The uploader is a wizard based tool which comprises the selection of project, experiment and workset. Projects and experimentsd can only be selected, if access rights were granted to the user. Inside the selected workset the user can select files and directories for upload. The retrieval of data has two parts, a search component and a download tool. With the search function the user can search for specific data containing a specific project, experiment, workset or filename. Additionally the user can search for data comprising a specific ontology, material or description. The Downloader is able to retrieve the data stored in the LIMS Light system. The user can decide wether a single file or multiple files or even whole worksets including subworksets should be downloaded.
LIMS Light usecase
The web interface supports the LIMS Light workflow, thus enabling the user to manage projects, experiments, worksets and files. This workflow makes it easy for all the different biological groups to upload their primary data. A user with sequence data can use it as well as a user storing image data. To show the principles of LIMS Light in a more practical way we have uploaded a usecase for public access (http://limslight.ipk-gatersleben.de (user: limslight_public, passwort: limslight_public)).
With the LIMS Light System we developed a primary data management system which provides an efficient storage system with a Hierarchical Storage Management System and an Oracle relational database. With our VPD Access Control Method we can guarantee the security of the stored primary data. We further developed a system which provides a high performance upload and download and an efficient retrieval of the data. With those first steps we are now able to proceed for the next step which is the integration of data citation so that scientists, using the LIMS Light system, can cite their primary data within their publications.
Primary lab data storage
For handling the primary lab data two kinds of data have to be stored. These are on one hand the data files themselves and on the other hand the metadata. The files are stored on a file system (also recommended by ). We used the hierarchical storage management (HSM) system SAM-FS (version 4.6.5) running under Sun Solaris 10.5. The meta information (e.g. taxonomy data, experiment conditions, genotypes, etc.) is stored in the relational database system Oracle (Oracle Enterprise Server 11 g version 220.127.116.11.0 - 64bit). Additionally, the references to the files are managed within the database together with the metadata. Therefore, the concept of Oracle BFiles as a special data type within the relational tables is used.
Data security as an important aspect of primary lab data management is realised application independent. The system has to guarantee, that data is only accessible for specific users and groups. We used the row based concept of data protection called RLS (Row Level Security). It is a fine-grained access control to limit the access at the row level for different operations (select, insert, update, delete) by defining a specific policy . RLS is implemented in Oracle as Virtual Private Database (VPD). The principle is to transparently add a where-clause to every statement issued against the data via SQL. This additional predicate is provided by a user defined database function, which is running in a privileged mode and can use metadata for decision. The management of specific access rights is stored in a separate schema of the used Oracle Database.
Data retrieval and upload
To fulfil the performance requirements for uploading primary lab data a combined approach is used. With the Oracle Application Express Technology (version 4.0.1.00.03) a web based user interface was developed. This interface supports the management of the metadata (e.g. file name, file type, author, file description, etc.). A connection to the Ontology Lookup Service (OLS)  is integrated. This helps to classify the origin of a file with controlled vocabularies. Furthermore the data retrieval is supported by the web based user interface. The file upload is handled by a Java Webstart Application (version 1.6), which can be started from the web based user interface to support a batch upload of selected files and/or folders.
Availability and requirements
The source codes of LIMS Light are available at http://dx.doi.org/10.5447/IPK/2011/0
Project name: LIMS Light
Project homepage:http://limslight.ipk-gatersleben.de (user: limslight public, password: limslight_public, read only access on a test dataset)
Operating system: platform independent
Programming language: Java, PL/SQL
Other requirements: Oracle, Java 1.6
License: GPL 2.0
Any restrictions to use by non-academics: Please contact the authors before using the system.
We thank Dr. Swetlana Friedel for fruitful discussions and system testing. Also we thank Burkhard Steuernagel for providing a dataset for the LIMS Light usecase. Further we want to thank Christian Friedrich for the implementation of the LIMSLight Downloader. Furthermore, we thank the BMBF for the financial support.
- Köhl Karin, Basler Georg, Lüdemann Alexander, Selbig Joachim, Walther Dirk: A plant resource and experiment management system based on the Golm Plant Database as a basic tool for omics research. Plant Methods. 2008, 4 (11):
- Tolopko AN, Sullivan JP, Erickson SD, Wrobel D, Chiang SL, Rudnicki K, Rudnicki S, Nale J, Selfors LM, Greenhouse D, Muhlich JL, Shamu CE: Screensaver: an open source lab information management system (LIMS) for high throughput screening facilities. BMC Bioinformatics. 2010, 11: 260-10.1186/1471-2105-11-260.PubMedPubMed CentralView ArticleGoogle Scholar
- Nix DA, Di Sera TL, Dalley BK, Milash BA, Cundick RM, Quinn KS: Next generation tools for genomic data generation, distribution, and visualization. BMC Bioinformatics. 2010, 11: 455-10.1186/1471-2105-11-455.PubMedPubMed CentralView ArticleGoogle Scholar
- Wendl M, Smith S, Pohl C, Dooling D, Chinwalla A, Crouse K, Hepler T, Leong S, Carmichael L, Nhan M, Oberkfell B, Mardis E, Hillier L, Wilson R: Design and implementation of a generalized laboratory data model. BMC Bioinformatics. 2007, 8:Google Scholar
- McPherson JD: Next-generation gap. Nature Methods. 2009, 6 (11): S2-S5. 10.1038/nmeth.f.268.PubMedView ArticleGoogle Scholar
- Gilbert GN: The Transformation of Research Findings into Scientific Knowledge. Social Studies of Science. 1976, 6 (3/4): 281-306.View ArticleGoogle Scholar
- Cochrane G, Akhtar R, Bonfield J, Bower L, Demiralp F, Faruque N, Gibson R, Hoad G, Hubbard T, Hunter C, Jang M, Juhos S, Leinonen R, Leonard S, Lin Q, Lopez R, Lorenc D, McWilliam H, Mukherjee G, Plaister S, Radhakrishnan R, Robinson S, Sobhany S, Hoopen PTT, Vaughan R, Zalunin V, Birney E: Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Research. 2009, D19-25. 37 Database
- Steuernagel B, Taudien S, Gundlach H, Seidel M, Ariyadasa R, Schulte D, Petzold A, Felder M, Graner A, Scholz U, Mayer K, Platzer M, Stein N: De novo 454 sequencing of barcoded BAC pools for comprehensive gene survey and genome analysis in the complex genome of barley. BMC Genomics. 2009, 10: e547-10.1186/1471-2164-10-547. [http://www.biomedcentral.com/1471-2164/10/547]View ArticleGoogle Scholar
- Köhler J, Philippi S, Lange M: SEMEDA: ontology based semantic integration of biological databases. Bioinformatics. 2003, 19 (18): 2420-2427. 10.1093/bioinformatics/btg340.PubMedView ArticleGoogle Scholar
- Grafahrend-Belau E, Weise S, Koschützki D, Scholz U, Junker BH, Schreiber F: MetaCrop: a detailed database of crop plant metabolism. Nucleic Acids Research. 2008, 36 (suppl_1): D954-D958.PubMedPubMed CentralGoogle Scholar
- Weise S, Colmsee C, Grafahrend-Belau E, Junker BH, Klukas C, Lange M, Scholz U, Schreiber F: An integration and analysis pipeline for systems biology in crop plant metabolism. Data Integration in the Life Sciences; 6th International Workshop, DILS 2009, Manchester, UK, 20-22 July 2009, Volume 5647 of Lecture Notes in Bioinformatics. Edited by: Paton N, Missier P, Hedeler C, Berlin et al. 2009, Springer, 196-203.Google Scholar
- Anderson NR, Tarczy-Hornoch P, Bumgarner RE: On the persistence of supplementary resources in biomedical publications. BMC Bioinformatics. 2006, 7: 260-10.1186/1471-2105-7-260.PubMedPubMed CentralView ArticleGoogle Scholar
- Piwowar HA, Day RS, Fridsma DB: Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE. 2007, 2 (3):
- Riaño-Pacón DM, Nagel A, Neigenfind J, Wagner R, Basekow R, Weber E, Mueller-Roeber B, Diel S, Kersten B: GabiPD: the GABI primary database - a plant integrative 'omics' database. Nucleic Acids Research. 2009, 37: D954-D959. 10.1093/nar/gkn611.View ArticleGoogle Scholar
- Weichselgartner E, Winkler-Nees S: Daten für alle!. Forschung. 2010, 35: 19-View ArticleGoogle Scholar
- Nelson EK, Piehler B, Eckels J, Rauch A, Bellew M, Hussey P, Ramsay S, Nathe C, Lum K, Krouse K, Stearns D, Connolly B, Skillman T, Igra M: LabKey Server: An open source platform for scientific data integration, analysis and collaboration. BMC Bioinformatics. 2011, 12: 71-10.1186/1471-2105-12-71.PubMedPubMed CentralView ArticleGoogle Scholar
- Côté RG, Jones P, Apweiler R, Hermjakob H: The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics. 2006, 2006 (7): 97-View ArticleGoogle Scholar
- Sears R, van Ingen C, Gray J: To BLOB or Not to BLOB: Large Object Storage in a Database or a Filesystem?. Techical Report MSR-TR-2006-45, Microsoft. 2006Google Scholar
- Oracle Database Security Guide 10g Release 2 (10.2). Oracle B14266-04, Oracle. 2008