MetabR: an R script for linear model analysis of quantitative metabolomic data
© Ernest et al.; licensee BioMed Central Ltd. 2012
Received: 19 June 2012
Accepted: 8 October 2012
Published: 30 October 2012
Skip to main content
© Ernest et al.; licensee BioMed Central Ltd. 2012
Received: 19 June 2012
Accepted: 8 October 2012
Published: 30 October 2012
Metabolomics is an emerging high-throughput approach to systems biology, but data analysis tools are lacking compared to other systems level disciplines such as transcriptomics and proteomics. Metabolomic data analysis requires a normalization step to remove systematic effects of confounding variables on metabolite measurements. Current tools may not correctly normalize every metabolite when the relationships between each metabolite quantity and fixed-effect confounding variables are different, or for the effects of random-effect confounding variables. Linear mixed models, an established methodology in the microarray literature, offer a standardized and flexible approach for removing the effects of fixed- and random-effect confounding variables from metabolomic data.
Here we present a simple menu-driven program, “MetabR”, designed to aid researchers with no programming background in statistical analysis of metabolomic data. Written in the open-source statistical programming language R, MetabR implements linear mixed models to normalize metabolomic data and analysis of variance (ANOVA) to test treatment differences. MetabR exports normalized data, checks statistical model assumptions, identifies differentially abundant metabolites, and produces output files to help with data interpretation. Example data are provided to illustrate normalization for common confounding variables and to demonstrate the utility of the MetabR program.
We developed MetabR as a simple and user-friendly tool for implementing linear mixed model-based normalization and statistical analysis of targeted metabolomic data, which helps to fill a lack of available data analysis tools in this field. The program, user guide, example data, and any future news or updates related to the program may be found athttp://metabr.r-forge.r-project.org/.
Quantitative metabolomics is a high-throughput approach to systems biology in which many small molecules (metabolites) from a biological sample are simultaneously measured, commonly using nuclear magnetic resonance spectroscopy (NMR), gas chromatography—mass spectrometry (GC-MS), or liquid chromatography—mass spectrometry (LC-MS). While transcriptomics and proteomics are established approaches for characterizing the effects of experimental conditions on metabolism, gene and protein expression changes merely indicate the potential for changes in metabolic endpoints. Metabolic changes are “real-world” endpoints, so metabolomics can connect these functional genomics platforms with actual physiology.
LC-MS metabolomic approaches fall into two categories: those that attempt to measure every metabolite in the sample (untargeted approaches) and those that attempt to measure only a subset of the metabolites (targeted approaches). A key benefit of targeted approaches is that the detected metabolites can also be readily quantified. Like other approaches to systems biology that rely on the analysis of multiple samples to generate large datasets, two important issues hold true in targeted metabolomics. First, experiments frequently are carried out in multiple “blocks”. For example, targeted LC-MS metabolomic platforms involve lengthy instrumental runs and may rely on multiple runs to enhance metabolite coverage[3, 4], often necessitating multiple run days to analyze all samples. Each run day represents a different block, which introduces technical variability in metabolite detection signals from day-to-day variances in factors related to the instrument’s operation, such as injection volume and ionization efficiency. Second, sampling and measurement variables introduce technical variability in metabolite detection signals, including tissue mass (for multicellular organisms), cell number and size (for microorganisms), sample matrix effects, and mass spectrometer variability (measured by the signal from an internal standard present in the metabolite extraction solvent in our experiments). Clearly, the metabolite signal variability due to block and sampling/measurement variables needs to be distinguished from variability due to experimental treatment factors, which calls for a normalization step to remove the effects of such confounding variables.
Conventional LC-MS metabolomic data normalization is carried out by expressing each metabolite signal relative to values of sampling/measurement variables[3, 4]. Statistical tests for mean differences between treatment groups are performed on normalized metabolite values, with metabolite means averaged across the levels of any block factors (i.e., run day).
There are limitations to this conventional normalization approach, however. First, often many metabolites are normalized to one internal standard (i.e., one for all positive ions and one for all negative ions). This would introduce additional bias if there were low or negative correlation between the internal standard signal and a metabolite signal (i.e., for metabolites with different chemical properties from the internal standard), or if the internal standard signal differed significantly between treatment groups. Second, while ignoring block factors (i.e., comparing metabolite means averaged across samples analyzed on different days) increases sample size, significant block effects on metabolite signals may widen confidence intervals, which may preclude identification of “significant” metabolites and conceal statistical outliers. Block effects may dramatically bias the data, especially if they are not balanced across treatment groups.
Currently available software packages provide powerful tools for pre-processing (i.e., peak selection and integration and retention time alignment), visualization (i.e., biochemical pathway mapping), and/or interpretation of targeted and untargeted metabolomic data[5–10]. However, these packages have limitations because they either 1) do not provide normalization tools for removing confounding effects of experimental variables[7–9]; 2) use the conventional normalization approach; or 3) require the researcher to manually determine a normalization factor for each experimental sample.
A flexible and standardized normalization approach that improves on current limitations would improve metabolomic analyses. An efficient and intuitive approach to control for confounding variables is to estimate their effects on metabolite signals using linear models. Rather than assuming similar relationships between each metabolite signal and confounding variables, a linear model fit for each metabolite can be used to estimate and partition the effects of each experimental variable, including treatment factor, on each metabolite signal. Further, experimental variables can be modeled as having either a fixed or random effect on metabolite signals, with important implications. Fixed-effect variables are assumed to have a constant effect on metabolite signals, influence metabolite signals in an anticipated direction, and have a similar influence in replicate experiments. Common fixed-effect variables are number of cells, tissue mass, and ionization efficiency. By comparison, the effects of random-effect variables cannot be anticipated a priori, and they create variation, but overall do not influence metabolite signals. Typical examples are specimen gender, species or line, experiment day, instrument, and technician, although some of these could be treated as hypothesis-driven experimental factors in some experiments.
Mixed models can be used to estimate the effects of fixed- and random-effect variables on a response variable and are an established approach for normalizing microarray data[12–21]. For two primary reasons, however, currently available microarray data normalization tools are not suitable for metabolomic data. First, microarray normalization tools adjust data for systematic effects specific to microarray technology, such as “dye bias” of different fluors, spatial position effects on the microarray chip, background signals, and biases due to probe binding strengths. Second, microarray normalization tools are often platform specific, designed to carry out pre-processing and quality control only for Illumina BeadArray or for Affymetrix GeneChip platforms, for example.
Given the limitations of current metabolomic data normalization approaches, we developed MetabR, a simple, user-friendly, and stand-alone tool that researchers with no programming background can use to implement linear model-based normalization and statistical analysis of targeted metabolomic data downstream of pre-processing. While MetabR is stand-alone, software with pre-processing tools[5, 6, 8] can be used to generate the input data for MetabR. Further, MetabR generates output files that may be used in subsequent analysis, including normalized data, a heat map and dendrogram, and a comma-separated values (CSV) file formatted for direct upload into Pathway Projector, a web-based biochemical pathway visualization tool.
where μ = group mean,
Group = treatment factor,
Quantity = a measured, continuous value of the amount of tissue used to produce each sample,
IS = a measured, continuous value of the detection signal from an internal standard present in the metabolite extraction solvent,
Day = a normalization factor accounting for the effects of different run days on metabolite signals,
and e = residual error.
The residuals and treatment group means from the fitted model are added together to yield normalized data, which adjusts for effects of sample quantity, ionization efficiency, and run day, as appropriate for the experimental design of the study.
Output files produced by the MetabR program
Normalized data with technical replicates averaged
A plot of the model residuals for each metabolite vs. each metabolite’s overall mean signal
A plot of the model residuals for each metabolite vs. each metabolite’s overall mean signal, expanded to accommodate metabolite labels
Mean plots for all significant metabolites
Tukey HSD p-values for all treatment group comparisons for every metabolite
q-values for all treatment group comparisons for every metabolite
Mean fold-changes between all treatment group comparisons for every metabolite
Plots of all confounding variables vs. all metabolite measurements, pre- and post-normalization
Heat map and dendrogram of the normalized data
Spreadsheet for direct upload to Pathway Projector
Two experimental datasets were generated in our lab to illustrate the utility of MetabR. In both experiments, adipose tissue samples were flash frozen in liquid nitrogen and powdered with a mortar and pestle before metabolite extraction, which followed a previously described procedure. The extracted metabolomes were then analyzed by liquid chromatography—tandem mass spectrometry (LC-MS/MS) via a slightly modified version of the methods of Rabinowitz and co-workers[30–32] that scans for approximately 350 total metabolites in positive and negative ionization modes. The Quan Browser function in the Xcalibur software package (Thermo Scientific, Waltham, MA) was used to assess the presence of each metabolite based on standard detection parameters, such as retention time, signal-to-noise ratio, and peak shape. Signal intensity in ion counts was then determined using Xcalibur to manually integrate each peak, and these data were exported into a Microsoft Excel spreadsheet for statistical analysis.
The first experiment was designed to examine the effects of dietary restriction and insulin immunoneutralization on adipose tissue metabolism in chickens. A total of 127 metabolites were detected in abdominal adipose tissue from 16- or 17-day-old male broiler chicks that were fed ad libitum (“Control”), fasted for 5 hours (“Fast”), or immunoneutralized against the effects of endogenous insulin (“InsNeut”), as we previously described[33, 34]. This study included two factors, Treatment and Day (day 1, day 2, or day 3). Fourteen metabolite measurements from this experiment are provided in Additional files3 (“Chicken example data 1”) and4 (“Chicken example data 2”), corresponding to metabolites detected in positive and negative ionization modes, respectively.
The second experiment was designed to examine the effects of Bisphenol A (BPA) on adipose tissue metabolism in mice. A total of 93 metabolites were detected in abdominal adipose tissue from 32 16-week-old inbred male mice which, from weaning, were fed ad libitum and given drinking water spiked with 0, 0.05, 0.5, or 5 μM BPA. Sixteen mice from each of the inbred strains C57BL/6J and DBA/2J were used in this study. A few missing values arose when a metabolite was not detected in a subset of the samples. Using a zero value for these measurements would bias the results, so they were set to missing (“NA”) which excludes that measurement from analysis. This study included three factors, Treatment, Strain (C57BL/6J or DBA/2J), and Day (day 1, day 2, day 3, or day 4). Twelve metabolite measurements from this experiment are provided in Additional files5 (“Mouse example data 1”) and6 (“Mouse example data 2”), corresponding to metabolites detected in positive and negative ionization modes, respectively.
In our chicken example, Group, Quantity, and IS were modeled as fixed-effect variables, while Day was modeled as a random-effect variable. To illustrate the difference, if Day is defined as a fixed-effect variable, the estimated treatment group mean includes the average Day effects, and the variance and corresponding confidence intervals are based only on residual error and sample size. Inferences about treatment effects refer only to the days used in the experiment. If Day is defined as a random-effect variable, the estimated mean no longer includes Day. Instead, the Day effect becomes a source of random variation that is added to the variance of the estimated mean. The variance and confidence intervals are larger than those when Day is treated as a fixed-effect variable, but experimental results can now be correctly extrapolated to all possible days.
For the chicken data, Quantity (tissue mass) and IS (internal standard measurement, Tris in positive ionization mode and Benzoic Acid in negative ionization mode) were selected as fixed-effect regression variables, and Day (run day) as a random-effect factor.
Chicken experiment fold-changes
Table 2 contains all between-group mean fold-changes for the metabolites, with differences tested by Tukey’s HSD at the 5% significance level. We produced this table by combining the mean fold-changes and p-values exported automatically by MetabR. As shown, the experiment had sufficient power to detect a fold-change as low as 1.25 for Citrate between Fast and Control groups. In general, the differences between the Control and InsNeut groups were smaller than other treatment group comparisons. The program exports q-values automatically, and the researcher may select p-value, q-value, mean fold-change, or a combination of either p-value or q-value and mean fold-change as a significance threshold. As technological improvements continue to allow more metabolites to be detected, the chance of false discoveries will increase, making false discovery corrections (q-value) increasingly necessary.
Mouse experiment fold-changes
The open-source statistical computing software R provides a convenient environment for statistical analysis of metabolomic and other -omic data. We developed a user-friendly R program that normalizes metabolomic data using linear mixed-effect modeling (with regression and ANOVA terms), statistically compares treatments, and exports results files to aid data interpretation, filling an important lack in statistical analysis tools available to the metabolomics community. The MetabR program file, example data, and user guide are available as an R-Forge project athttp://metabr.r-forge.r-project.org/. This website will also contain future news or updates related to MetabR, including availability through the Comprehensive R Archive Network (CRAN) or Bioconductor.
Project name: MetabR
Project home page: http://metabr.r-forge.r-project.org/
Operating system(s): Windows, Mac, Linux, any system that runs R
Programming language: R
Other requirements: Required R packages are installed automatically. The program was written and tested using R version 2.15 for Windows.
License: GNU General Public License (GPL)
Any restrictions to use by non-academics: No restrictions
The datasets supporting the results of this article are included within the article (and its additional files).
J Gooding’s current address: Sarah W. Stedman Nutrition & Metabolism Center, Duke University School of Medicine, 4321 Medical Park Drive, Suite 200, Durham, NC 27704
Analysis of variance
Graphical user interface
Honest Significant Difference
liquid chromatography—mass spectrometry
Liquid chromatography—tandem mass spectrometry.
JRG and SRC were supported by funding from the National Science Foundation through an Ocean Sciences award (OCE-1061352) to the University of Tennessee at Knoxville. Funding for metabolomic analyses of chicken adipose tissue was provided by a University of Tennessee AgResearch Innovation Grant to BHV and SRC.
The authors thank Brantley Wyatt, previously of the University of Tennessee Graduate School of Genome Science and Technology, for conducting the mouse experiments and generating the mouse adipose tissue samples used in this work, and Drs. Joelle Dupont and Jean Simon of the Institut National de la Recherche Agronomique (INRA) for conducting the chicken experiments and providing the corresponding adipose tissue samples.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.