CastorDB: a comprehensive knowledge base for Ricinus communis
© Chattoo et al; licensee BioMed Central Ltd. 2011
Received: 21 June 2011
Accepted: 13 September 2011
Published: 13 September 2011
Ricinus communis is an industrially important non-edible oil seed crop, native to tropical and subtropical regions of the world. Although, R. communis genome was assembled in 4X draft by JCVI, and is predicted to contain 31,221 proteins, the function of most of the genes remains to be elucidated. A large amount of information of different aspects of the biology of R. communis is available, but most of the data are scattered one not easily accessible. Therefore a comprehensive resource on Castor, Castor DB, is required to facilitate research on this important plant.
CastorDB is a specialized and comprehensive database for the oil seed plant R. communis, integrating information from several diverse resources. CastorDB contains information on gene and protein sequences, gene expression and gene ontology annotation of protein sequences obtained from a variety of repositories, as primary data. In addition, computational analysis was used to predict cellular localization, domains, pathways, protein-protein interactions, sumoylation sites and biochemical properties and has been included as derived data. This database has an intuitive user interface that prompts the user to explore various possible information resources available on a given gene or a protein.
CastorDB provides a user friendly comprehensive resource on castor with particular emphasis on its genome, transcriptome, and proteome and on protein domains, pathways, protein localization, presence of sumoylation sites, expression data and protein interacting partners.
Ricinus communis (Euphorbiaceae family) is an industrially important non-edible oil seed crop with several well established applications in industry. Castor bean genome is around 350 Mb and was sequenced and assembled in 4X draft by Chan et al.  using whole genome shortgun strategy and is predicted to contain 31,221 proteins, although the function of most of these proteins remains unknown. Thus, a comprehensive database has been developed to provide a useful resource by integrating information on genome, transcriptome, and proteome of R. communis. Sequence data of Castor bean plant was obtained from various resources like National Center for Biotechnology Information (NCBI)  and JCVI Castor Bean Genome Database . Appropriate programs were developed to establish a connection with various databases for accessing the information using API. Important information extracted from the analyzed data was compiled in a back-end database using MySQL database server  for the construction of CastorDB. The information incorporated in CastorDB was generated by comparing the information extracted from different resources thus a comprehensive resource has been built for R. communis with information on protein domains, biosynthetic pathways, protein localization, and presence of sumoylation sites, gene expression data, and information on interaction between proteins. CastorDB not only provides researchers an opportunity to extract detailed biological information on any specific gene or protein from a single resource but also prompts the researcher to use the information to explore new information that is becoming available in plant genomics.
Sequence information on 31,221 proteins and genes of R. communis was downloaded from JCVI Castor Genome database  on January 12, 2009. Sequences from this database have unique locus identifiers, which were used during the analysis for distinguishing sequences from each other. A large number of sequences obtained were described as either hypothetical or predicted.
dbEST  is a division of NCBI that contains EST data and "single-pass" cDNA sequences from various organisms. About 60,000 ESTs from different tissues of R. communis were obtained from dbEST. Each EST sequence was mapped on genes by performing nucleotide BLAST  against mRNA sequences from R. communis with e-value cutoff 10-6.
R. communis proteins were mapped with gene ontology information on the basis of GO annotation available for Pfam domains from Gene Ontology database [7, 8]. The mapping of GO annotation to Pfam Domain was generated from data available from InterPro database for InterPro2GO mapping . 11847 proteins were mapped with probable GO annotation in R. communis.
Putative pathways for the R. communis protein sequences were predicted by using KEGG Pathway database . KEGG PATHWAY is a collection of manually drawn pathway maps representing knowledge on the molecular interaction and reaction networks and incorporating information for approximately 146,590 pathway maps from different species belonging to 407 reference pathways. R. communis proteins (31,221) were compared to the Swiss-Prot database [19, 20] using BlastP  API from DDBJ  with an E-value cut-off of 10-6. Each query protein sequence from R. communis was assigned probable pathways based on pathway information available from KEGG database for their homologous protein sequences in other species. A total of 112 probable pathways were predicted for 3785 Castor bean proteins. All predicted pathways were manually checked to remove false positives from the prediction result.
Putative sumoylation sites in Ricinus communis proteins were predicted using SUMOsp 2.0  software for sumoylation site prediction by the Cuckoo work group. The non-redundant training data in software contained 279 sumoylation sites from 166 distinct proteins. SUMOsp 2.0 predicted sumoylation sites for 9755 protein sequences in R. communis at a high cut-off value.
Biochemical properties of the protein sequences were calculated using Pepstats program from European Molecular Biology Open Software Suite (EMBOSS) package . Pepstats was programmatically linked and used to predict biochemical properties of R. communis proteins. Pepstats calculated molecular weight, isoelectric point, charge, size of protein, extinction coefficient and average residue weight for all the proteins in R. communis.
Best NCBI and KEGG Homologue
In order to find the best homologue for R. communis protein sequence in NCBI  and KEGG , protein BLAST  was performed at e-value cutoff 10-10 against protein sequence dataset obtained from NCBI and KEGG using keyword Ricinus communis. The hit with maximum identity and lowest e-value was selected as best homologue.
Architecture and Design of CastorDB
Tier 1: User Interface
Graphical interface provides the user access to CastorDB using various input queries and provide links to additional information pages which guide the user during browsing of CastorDB. The query inputs from user interface are sent to program and scripts in layer T2 via post method.
Tier 2: Programs for analysis
T2 consists of Apache web server  for Windows platform and scripts written using Perl CGI . Perl CGI scripts use bioperl modules to support use of local BLAST  obtained from NCBI ftp site and parse result to represent the necessary information on browser. CGI scripts also use MySQL Perl API to connect to the MySQL database  in tier T3. Perl DBI module along with DataBase Driver (DBD) for different type of server provides a generic interface for database access. Complex queries that analyze a large variety of different types of data can, therefore, be realized in a fairly intuitive manner.
Tier 3: Database Schema
The Relational Database Management System MySQL  was used to store data integrated in CastorDB. MySQL run as a server and provides multiple-user access to number of different databases. The database schema had been implemented using MySQL Perl API, an Application Programming Interface (API), for accessing data in a heterogeneous environment of relational and non-relational database management systems in Perl programming language.
Web Interface Access
CastorDB provides access to explore the stored information by three different kinds of search methods: (i) Simple Search (ii) Advanced Search (iii) BLAST Search using protein or nucleotide sequence.
BLAST [6, 21] based search allows user to browse CastorDB using sequence in FASTA format. The option allows search against protein and nucleotide sequence database of R. communis generated using formatdb from standalone BLAST package. The result table generated after running the program display BLAST hits sorted according to percent identity in descending order.
Representation of analysis results
This feature allows user to download information in form of text file for all gene appearing in search result using multiple export option or by selecting each gene individually.
Other web interfaces
Other web interfaces includes "Help" section which provides description of each query option and accepted keyword input in CastorDB. "Literature and Links" section provides links to external literature databases such as Pubmed and Agricol; and links to web resources used during analysis of castor genome. "Feedback" section allows user to comment on data and utilities incorporated in CastorDB.
The queries provided by CastorDB are focused on retrieving available information from various databases along with queried information for a particular gene or protein in R. communis. Currently, information about this important oil seed plant is available in different sources. Among the existing databases, (i) JCVI Castor Genome Database and NCBI provides sequence information on R. communis genes and proteins; (ii) Information on EST expressed during different condition is available from dbEST division of NCBI database.
CastorDB is, designed to facilitate the analysis of information on R. communis obtained from various resources and develop a comprehensive database. CastorDB database provides researchers information not only on gene and protein sequences but also on possible Go annotation, domains present in a protein, predicted pathways, probable interacting partners, sub-cellular localization, protein sumoylation sites, gene expression and even biochemical properties of a given protein. In addition to a common BLAST search, CastorDB provides the user with a scope for keyword search using the options like CastorDB ID, locus tag, gene name, domain name, pathway, localization, EST accession number. Also, some of the experimental data obtained from external resources are represented in more interpretable form which can provide researchers with a better understanding about the plant and help in designing critical experiments to gain deep insights into its biology. In order to incorporate newer findings the database will be updated in every 6 months.
CastorDB was generated by correlating the information available on its genome, transcriptome, and proteome and a comprehensive resource was built on protein domains, pathways, protein localization, presence of sumoylation sites, expression data, protein interacting partners, etc. In addition to a common BLAST search and simple keyword search, CastorDB provides the user with a scope of doing advanced search by using different keywords and options. Also, some of the experimental data obtained from external resources are represented in more interpretable form. Thus, CastorDB would be an important database providing researchers with information to better understand the biology of this important plant.
Availability and requirements
Project Name: CastorDB: a comprehensive knowledge base for Ricinus communis
Project homepage: The database is currently available at http://CastorDB.msubiotech.ac.in
Operating system(s): Platform independent
License: Free for academics, Authorization is needed for commercial use (Please contact the corresponding author for more details)
This work was supported by grants from Department of Biotechnology (DBT), Ministry of Science and Technology, Govt. of India.
- Chan AP, Crabtree J, et al.: Draft genome sequence of the oilseed species Ricinus communis. Nat Biotech. 2010, 28 (9): 951-956. 10.1038/nbt.1674.View ArticleGoogle Scholar
- National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov]
- JCVI Castor Bean Genome Database. [http://castorbean.jcvi.org/index.shtml]
- MySQL database server. [http://www.mysql.com/]
- Boguski MS, Lowe TM, Tolstoshev CM: dbEST--database for expressed sequence tags. Nature Genetics. 1993, 4 (4): 332-3. 10.1038/ng0893-332.PubMedView ArticleGoogle Scholar
- Altschul , Stephen F, Warren G, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 10-215: 403-View ArticleGoogle Scholar
- Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S, AmiGO Hub, Web Presence Working Group: AmiGO: online access to ontology and annotation data. Bioinformatics. 2009, 25 (2): 288-9. 10.1093/bioinformatics/btn615.PubMedPubMed CentralView ArticleGoogle Scholar
- Ashburner Michael, et al.: Gene ontology: tool for the unification of biology. Nature Genetics. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Hunter, et al.: InterPro: the integrative protein signature database. Nucleic Acids Res. 2009, 37Â: D211-D215.View ArticleGoogle Scholar
- Horton P, Park K-J, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K: WoLF PSORT: Protein Localization Predictor. Nucleic Acid Res. 2007, W585-W587. 35 Web ServerGoogle Scholar
- Nielsen H, Engelbrecht J, Brunak S, Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Prot Engg. 1997, 10 (1): 1-6. 10.1093/protein/10.1.1.View ArticleGoogle Scholar
- Jannick D, Gunnar von H, Søren B: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340: 783-795. 10.1016/j.jmb.2004.05.028.View ArticleGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden Markov model: Application to complete genome. Journal of Molecular Biology. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.PubMedView ArticleGoogle Scholar
- Sonnhammer ELL, von Heijne G, Krogh A: A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology. Edited by: Glasgow J, Littlejohn T, Major F, Lathrop R, Sankoff D, Sensen C. 1998, Menlo Park, CA, AAAI Press, 175-182.Google Scholar
- Nakai K, Horton P: PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trend Biochem Sci. 1999, 24 (1): 34-35. 10.1016/S0968-0004(98)01336-X.PubMedView ArticleGoogle Scholar
- Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R: Pfam: clans, web tools and services. Nucl Acid Res. 2006, D247-D251. 34 DatabaseGoogle Scholar
- Servant F, Bru C, Carrere S, Courcelle E, Gouzy J, Peyruc D, Kahn D: ProDom: Automated clustering of homologous domains. Brief Bioinform. 2002, 3 (3): 246-251. 10.1093/bib/3.3.246.PubMedView ArticleGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl Acid Res. 1999, 27 (1): 29-34. 10.1093/nar/27.1.29.View ArticleGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucl Acid Res. 1997, 25 (1): 31-36. 10.1093/nar/25.1.31.View ArticleGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucl Acid Res. 1998, 26 (1): 38-42. 10.1093/nar/26.1.38.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- DNA Data Bank of Japan. [http://www.ddbj.nig.ac.jp/]
- Cui Jian, Li Peng, Li Guang, Xu Feng, Zhao Chen, Li Yuhua, Yang Zhongnan, Wang Guang, Yu Qingbo, Li Yixue, Shi Tieliu: AtPID: Arabidopsis thaliana protein interactome database an integrative platform for plant systems biology. Nucleic Acids Research. 2008, 36: D999-D1008.PubMedPubMed CentralView ArticleGoogle Scholar
- Swarbreck David, Wilks Christopher, Lamesch Philippe, et al.: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Research. 2008, 36: D1009-D1014.PubMedPubMed CentralView ArticleGoogle Scholar
- Xue Y, Zhou F, Fu C, Xu Y, Yao X: SUMOsp: a web server for sumoylation site prediction. Nucl Acid Res. 2006, W254-W257. 34 Web ServerGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trend Genet. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.View ArticleGoogle Scholar
- JAVA. [http://www.sun.com/java/]
- Perl CGI Scripts. [http://www.activestate.com/Products/activeperl/index.mhtml]
- Apache web server. [http://httpd.apache.org/]
- Sigrist CJA, Cerutti L, de Castro E: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 2010, 161-6. 38 DatabaseGoogle Scholar
- Cytoscape. [http://www.cytoscape.org]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.