Skip to main content
  • Technical Note
  • Open access
  • Published:

iTAP: integrated transcriptomics and phenotype database for stress response of Escherichia coli and Saccharomyces cerevisiae

Abstract

Background

Organisms are subject to various stress conditions, which affect both the organism’s gene expression and phenotype. It is critical to understand microbial responses to stress conditions and uncover the underlying molecular mechanisms. To this end, it is necessary to build a database that collects transcriptomics and phenotypic data of microbes growing under various stress factors for in-depth systems biology analysis. Despite of numerous databases that collect gene expression profiles, to our best knowledge, there are few, if any, databases that collect both transcriptomics and phenotype data simultaneously. In light of this, we have developed an open source, web-based database, namely integrated transcriptomics and phenotype (iTAP) database, that records and links the transcriptomics and phenotype data for two model microorganisms, Escherichia coli and Saccharomyces cerevisiae in response to exposure of various stress conditions.

Results

To collect the data, we chose relevant research papers from the PubMed database containing all the necessary information for data curation including experimental conditions, transcriptomics data, and phenotype data. The transcriptomics data, including the p value and fold change, were obtained through the comparison of test strains against control strains using Gene Expression Omnibus’s GEO2R analyzer. The phenotype data, including the cell growth rate and the productivity, volumetric rate, and mass-based yield of byproducts, were calculated independently from charts or graphs within the reference papers. Since the phenotype data was never reported in a standardized format, the curation of correlated transcriptomics–phenotype datasets became extremely tedious and time-consuming. Despite the challenges, till now, we successfully correlated 57 and 143 datasets of transcriptomics and phenotype for E. coli and S. cerevisiae, respectively, and applied a regression model within the iTAP database to accurately predict over 93 and 73 % of the growth rates of E. coli and S. cerevisiae, respectively, directly from the transcriptomics data.

Conclusion

This is the first time that transcriptomics and phenotype data are categorized and correlated in an open-source database. This allows biologists to access the database and utilize it to predict the phenotype of microorganisms from their transcriptomics data. The iTAP database is freely available at https://sites.google.com/a/vt.edu/biomolecular-engineering-lab/software.

Background

Microorganisms face numerous stress conditions [13], such as oxidative stress [46], weak organic acid stress [79], nutrient limitation [10, 11], and environment fluctuation [12]. These stresses, both biotic and abiotic, occur throughout nature and comprise the ecology of the system [1, 13, 14]. Each stress condition elicits a microbial response to adapt to the unfavorable environmental conditions [1517]. The provoked responses of microorganisms alter the current ecosystem in which they live and affect the other organisms as well [16, 18]. Such microbial responses could be recreated in laboratories and allow for a deeper understanding of the correlation between gene expressions and phenotypes [4, 7, 12]. Of particular interests to systems biologists, uncovering the correlation between transcriptomics and phenotype could identify ‘genetic markers’ that are primarily responsible for the occurrence of a particular phenotype within a species [68]. This would help determine the genetic causations of certain phenotypes across strains. With the genetic markers identified, the phenotype of strains could possibly be predicted from its transcriptomics data directly, which has great potentials in biochemical, ecological, biomedical, and environmental applications [19].

The first step towards uncovering correlations between the transcriptomics and phenotype of various microorganisms is to collect curated and coupled transcriptomics–phenotype datasets for various microorganisms. Currently, there are multiple popular databases such as the Gene Expression Omnibus (GEO) [2022], the European Bioinformatics Institute (EBI) [23], and Many Microbe Microarrays Database (M3D) [24] that contain gene expression data. Data submitted to these databases is mostly meta-data on transcriptomics analysis, which include experimental conditions and the global gene expressions measured by either microarray [20, 25] or RNAseq analysis [26]. Comprised of over a million samples, these databases allow the analysis of large quantities of transcriptomics data; however, these databases lack the phenotypic data associated with these genotypes. Therefore, although thousands of data series and datasets are enabled for users to query for gene expression analysis, those datasets cannot provide the details about the phenotype such as cell growth rate, and hence, have limited applications in elucidating the correlations between transcriptomics and phenotype of microorganisms.

In this study, we developed an integrated transcriptomics and phenotype (iTAP) database that contained the correlated transcriptomics and phenotype datasets by collecting research articles that reported both types of data during its creation and curating the phenotype data with a standardized format. In general, we collected the transcriptomics data from GEO to provide p values and fold changes for each of the genes in Escherichia coli and Saccharomyces cerevisiae by comparing the gene expression of various strains against a reference strain as indicated in the corresponding publication. In parallel, we collected phenotype data associated with the transcriptomics data and numerically represented them as growth rates, productivity, volumetric rates, and mass-based yield of byproducts. The iTAP database also contained the experimental conditions and stress factors that the strains were subjected to. So far, we have collected, respectively, 143 and 57 datasets for S. cerevisiae and E. coli. Additionally, we demonstrated that it was feasible to use the correlated transcriptomics–phenotype datasets within the iTAP database to accurately predict cell growth rates for both S. cerevisiae and E. coli in a proof-of-concept study. Collecting this data proved to be strenuous and time-consuming, which limited the fast scale-up of the iTAP database. As the first of its kind, the iTAP database was able to identify the genetic markers, and potentially, guide synthetic biologists to rationally modify the microbial phenotypes by suppression or overexpression of genes of interests.

Implementation

Data collection and curation

As shown in Fig. 1, each of the dataseries in the database contained experimental conditions, transcriptomics data, and phenotype values of a microorganism, which were obtained from relevant research papers from the PubMed database. To ensure that both transcriptomics and phenotypic data were available in each of the dataseries, only papers that included all of the necessary details mentioned above were chosen to be included within the iTAP database.

Fig. 1
figure 1

Architecture of the iTAP database

In general, the experimental conditions were obtained directly from the chosen research papers, including information regarding the strain name, carbon source, culture medium, stress factor, and concentration of the stress factor. The transcriptomics data was collected as the gene expression levels of a test strain subjected to a particular stress condition when compared against a reference strain, and was recorded with p values and fold changes. Such data was obtained from the GEO database. The GEO database software tool, GEO2R analyzer [27], was used to compare the gene expression levels of strains facing stress conditions to the reference strain (Fig. 2). After choosing the test and reference strains and running the software, we obtained the corresponding gene expression levels, including p value and fold change. It is worth noticing that GEO, EBI and M3D share a lot of transcriptomics data that are exactly the same. Therefore, the same transcriptomics results can be generated when using EBI or M3D. The phenotype data was collected quantitatively as growth rates, productivity, volumetric rates, and mass-based yield of byproducts. Such data was obtained from different charts and graphs from the chosen research papers and was calculated independently. For example, all of the growth rates were calculated directly from the biomass data (e.g., dry cell weights at various time points) in the selected publications while the other phenotype data (e.g. the byproduct rates) was not used for calculating growth rates. Specifically, the phenotype data was calculated as:

$${\text{Productivity}}\, ( {\text{mmol/g/h)}}\,{ = }\frac{{{\text{Initial}}\,{\text{concentration}}\,{\text{of}}\,{\text{product}} - {\text{Final}}\,{\text{concentration}}\,{\text{of}}\,{\text{product }}\,\left( {{{\text{mmol}} \mathord{\left/ {\vphantom {{\text{mmol}} {\text{L}}}} \right. \kern-0pt} {\text{L}}}} \right)}}{{\left( {{\text{Final}}\,{\text{time}} - {\text{Initial}}\,{\text{time,}}\,{\text{h}}} \right) \, \times \,{\text{Initial biomass }}\left( {{{\text{g}} \mathord{\left/ {\vphantom {{\text{g}} {\text{L}}}} \right. \kern-0pt} {\text{L}}}} \right)}}$$
(1)

where biomass was assumed as: 1 OD = 0.4 g/L;

$${\text{Growth}}\,{\text{rate}}\,\left( {{\text{h}}^{ - 1} } \right) = \frac{{\ln \left( {\frac{{{\text{Final}}\,{\text{biomass}}}}{{{\text{Initial}}\,{\text{biomass}}}}} \right)}}{{{\text{Final}}\,{\text{time}} - {\text{Initial}}\,{\text{time}}\,\left( {\text{h}} \right)}}$$
(2)
$${\text{Volumetric rate}}\,\, ( {\text{g/L/h) = }}\frac{{ {\text{Initial concentration of product}} - \,{\text{Final concentration of product }}\, ( {\text{g/L)}}}}{{{\text{Final time }} - {\text{Initial time (h)}}}}$$
(3)
$${\text{Yield }}\,({\text{g/g}}) \;{ = }\;\frac{{{\text{Volumetric rate of product (e}} . {\text{g}} . , {\text{ethanol)}}}}{{{\text{Volumetreic rate of carbon source (e}} . {\text{g}} . , {\text{glucose)}}}}$$
(4)
Fig. 2
figure 2

Example of collecting the correlated transcriptomics–phenotype datasets

We directly used the data when the desired phenotype data was reported in the selected publications. Otherwise, we utilized Plot Digitizer software [28] for graphs within the publication to obtain values of specific points depicting the rate of consumption of glucose, rate of production of products, and growth curves of different strains. All of the data, both the raw data collected from the research papers and the standardized data we calculated, were reported in the iTAP database. It is worth noticing that the majority of the phenotype data collected in the iTAP database were growth related, which is one aspect of the composite of observable characteristics of E. coli or S. cerevisiae.

Data distribution

The iTAP database is an open source, web-based database that is freely available for use (https://sites.google.com/a/vt.edu/biomolecular-engineering-lab/software). It was developed based on Zoho Creator, an online database software that offers the data collection, cloud storage, data backup, and basic data analysis to present all the information in an efficient, user-friendly manner, as shown in Fig. 3. Advanced search with various logic symbols was available for each dataset, allowing users to find their required information efficiently. In addition, user could sort and group one or multiple dataset(s), and browse and print each dataset with different kinds of information by using the “Print” or “View Record” option to output the database. Users were also able to access the real-time data in mobile apps and download any datasets within iTAP as .csv files.

Fig. 3
figure 3

Screenshot of the iTAP database

Results and discussion

To explore the possibility of using iTAP to predict cell phenotype, we first calculated the Pearson’s correlation coefficient [29] of the expression levels of each gene and the corresponding cell growth rate in the entire iTAP database for S. cerevisiae and E. coli, respectively, then picked the top five genes whose expression levels were highly correlated to cell growth rate as the genetic markers for S. cerevisiae and E. coli respectively, and applied multi-variant linear regression model in MATLAB to correlate the expression levels of the genetic markers and the cell growth rates (Fig. 4). The genes whose expression levels in the test strain were not significantly different from those in the reference strain (i.e., p > 0.25) were set to have a fold change as zero. We found that the growth rates of both S. cerevisiae and E. coli could be accurately predicted, with R2 reaching 0.73 and 0.86, respectively. Also, the genetic markers we identified had high coverage of all the case studies collected in the iTAP database, reaching 72.7 and 93.0 % for S. cerevisiae and E. coli respectively. This indicated that in most of the transcriptomics studies on stress responses of S. cerevisiae and E. coli, the selected genetic markers were significantly regulated and could be generally used to predict cell phenotypes such as growth rate.

Fig. 4
figure 4

Application of the iTAP database to accurately predict growth rates of S. cerevisiae and E. coli

We next analyzed the effect of the p value, which was used to judge whether or not a gene expression level in the test strain was significantly different from that in the reference strain, on prediction accuracy of cell growth rates and the coverage of case studies in the iTAP database. We found that the prediction could maintain a high accuracy, with R2 ranging from 0.61 to 0.74 for S. cerevisiae and 0.82–0.86 for E. coli. With the increase of the p-value, the coverage of case studies in the iTAP strain increased accordingly, since the expression levels of a gene in the test strain would be more frequently recognized as significantly different from that in the reference strain with a loose threshold of the p-value. Overall, by using iTAP database, we successfully proved that it was indeed possible to predict cell phenotypes from the characterization of global gene expressions.

Conclusions

In this study, the iTAP database was constructed by utilizing research papers involving stress responses for two model organisms, E. coli and S. cerevisiae. To develop iTAP database, gene expression data, specifically the p values and fold changes, were obtained from the GEO database, while the phenotype data was calculated from numerical information provided in multiple research papers and standardized to a defined form of representation to ensure the uniformity of the data. Till now, we have successfully curated 57 and 143 datasets for E. coli and S. cerevisiae, respectively. This study also proved that with the “big data” of coupled transcriptomics–phenotype datasets, we could achieve accurate predictions of cell phenotypes, such as growth rates, directly from transcriptomic readouts and identify the genes that affect the occurrence of the phenotype most significantly. It is intended that this open-source, web-based database will be expanded to include not only more dataseries for the existing microorganisms by considering other stress conditions, but also to increase the number of microorganisms studied and include multi-omics data in future.

Availability and requirements

  • Project name: Integrated Transcriptomics and Phenotype Database (iTAP)

  • Project homepage: https://sites.google.com/a/vt.edu/biomolecular-engineering-lab/software

  • Operating systems: Platform independent

  • Programming language: Zoho Creator

  • License: iTAP is freely available for noncommercial purposes

  • Any restrictions to use by non-academics: none

References

  1. Storz G, Hengge R. Bacterial stress responses. Washington, DC: American Society for Microbiology Press; 2010.

    Google Scholar 

  2. Hecker M, Völker U. General stress response of Bacillus subtilis and other bacteria. In: Advances in microbial physiology. vol 44. New York: Academic Press; 2001. p. 35–91.

  3. Beales N. Adaptation of microorganisms to cold temperatures, weak acid preservatives, low pH, and osmotic stress: a review. Compr Rev Food Sci Food Saf. 2004;3(1):1–20.

    Article  CAS  Google Scholar 

  4. Jozefczuk S, Klie S, Catchpole G, Szymanski J, Cuadros-Inostroza A, Steinhauser D, Selbig J, Willmitzer L. Metabolomic and transcriptomic stress response of Escherichia coli. Mol Syst Biol. 2010;6:364.

    Article  PubMed Central  PubMed  Google Scholar 

  5. Lushchak VI. Adaptive response to oxidative stress: bacteria, fungi, plants and animals. Comp Biochem Physiol C: Toxicol Pharmacol. 2011;153(2):175–90.

    Google Scholar 

  6. Ma M, Liu ZL. Comparative transcriptome profiling analyses during the lag phase uncover YAP1, PDR1, PDR3, RPN4, and HSF1 as key regulatory genes in genomic adaptation to the lignocellulose derived inhibitor HMF for Saccharomyces cerevisiae. BMC Genom. 2010;11:660.

    Article  CAS  Google Scholar 

  7. Abbott DA, Knijnenburg TA, de Poorter LMI, Reinders MJT, Pronk JT, van Maris AJA. Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae. FEMS Yeast Res. 2007;7:819–33.

    Article  CAS  PubMed  Google Scholar 

  8. An J, Kwon H, Kim E, Lee YM, Ko HJ, Park H, Choi I-G, Kim S, Kim KH, Kim W, et al. Tolerance to acetic acid is improved by mutations of the TATA-binding protein gene. Environ Microbiol. 2015;17(3):656–69.

    Article  CAS  PubMed  Google Scholar 

  9. Xia J-M, Yuan Y-J. Comparative lipidomics of four strains of Saccharomyces cerevisiae reveals different responses to furfural, phenol, and acetic acid. J Agric Food Chem. 2009;57(1):99–108.

    Article  CAS  PubMed  Google Scholar 

  10. Tai SL, Boer VM, Daran-Lapujade P, Walsh MC, de Winde JH, Daran J-M, Pronk JT. Two-dimensional transcriptome analysis in chemostat cultures: combinatorial effects of oxygen availability and macronutrient limitation in Saccharomyces cerevisiae. J Biol Chem. 2005;280(1):437–47.

    Article  CAS  PubMed  Google Scholar 

  11. Brauer MJ, Huttenhower C, Airoldi EM, Rosenstein R, Matese JC, Gresham D, Boer VM, Troyanskaya OG, Botstein D. coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast. Mol Biol Cell. 2008;19(1):352–67.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Singh J, Kumar D, Ramakrishnan N, Singhal V, Jervis J, Garst JF, Slaughter SM, DeSantis AM, Potts M, Helm RF. Transcriptional response of Saccharomyces cerevisiae to desiccation and rehydration. Appl Environ Microbiol. 2005;71(12):8752–63.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Schimel J, Balser TC, Wallenstein M. Microbial stress-response physiology and its implications for ecosystem function. Ecology. 2007;88(6):1386–94.

    Article  PubMed  Google Scholar 

  14. Çakar ZP, Seker UOS, Tamerler C, Sonderegger M, Sauer U. Evolutionary engineering of multiple-stress resistant Saccharomyces cerevisiae. FEMS Yeast Res. 2005;5(6–7):569–78.

    Article  PubMed  Google Scholar 

  15. Kültz D. Molecular and evolutionary basis of the cellular stress response. Annu Rev Physiol. 2005;67(1):225–57.

    Article  PubMed  Google Scholar 

  16. Abee T, Wouters JA. Microbial stress response in minimal processing. Int J Food Microbiol. 1999;50(1–2):65–91.

    Article  CAS  PubMed  Google Scholar 

  17. Ramos JL, Gallegos Ma-T, Marqués S, Ramos-González M-I, Espinosa-Urgel M, Segura A. Responses of gram-negative bacteria to certain environmental stressors. Curr Opin Microbiol. 2001;4(2):166–71.

    Article  CAS  PubMed  Google Scholar 

  18. Love N, Bott C. Evaluating the role of microbial stress response mechanisms in causing biological treatment system upset. Water Sci Technol. 2002;46(1–2):11–8.

    CAS  PubMed  Google Scholar 

  19. Heer D, Heine D, Sauer U. Resistance of Saccharomyces cerevisiae to high concentrations of furfural is based on NADPH-dependent reduction by at least two oxireductases. Appl Environ Microbiol. 2009;75(24):7631–8.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Edgar R, Barrett T. NCBI GEO standards and services for microarray data. Nat Biotechnol. 2006;24(12):1471–2.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009;37(suppl 1):D885–90.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, et al. NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic Acids Res. 2011;39(suppl 1):D1005–10.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Stoesser G, Tuli MA, Lopez R, Sterk P. The EMBL nucleotide sequence database. Nucleic Acids Res. 1999;27(1):18–24.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS. Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36:D866–70.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467–70.

    Article  CAS  PubMed  Google Scholar 

  26. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  27. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41(D1):D991–5.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Huwaldt JA. Plot Digitizer: GPL. Available: http://plotdigitizer.sourceforge.net/. Accessed 2013 May 6.

  29. Stigler SM. Francis Galton’s account of the invention of correlation. Stat Sci. 1989;4(2):73–9.

    Article  Google Scholar 

Download references

Authors’ contributions

NS and XF initiated this study. NS, CA, WG, and JS collected the datasets. RB assisted with the development of iTAP database using Zoho Creator. NS, CA, and WG performed the computational analysis. NS, CA, WG and XF wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank the writing center in Virginia Tech for improving the language of the paper. This study was supported by start-up fund (#175323) from Virginia Tech and NSF (DBI 1356669).

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xueyang Feng.

Additional information

Niveda Sundararaman, Christine Ash and Weihua Guo contributed equally to this work

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sundararaman, N., Ash, C., Guo, W. et al. iTAP: integrated transcriptomics and phenotype database for stress response of Escherichia coli and Saccharomyces cerevisiae . BMC Res Notes 8, 771 (2015). https://doi.org/10.1186/s13104-015-1759-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13104-015-1759-7

Keywords