The bench scientist's guide to statistical analysis of RNA-Seq data
© Yendrek et al.; licensee BioMed Central Ltd. 2012
Received: 28 February 2012
Accepted: 10 September 2012
Published: 14 September 2012
RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatics specialists. Here we provide a step-by-step guide and outline a strategy using currently available statistical tools that results in a conservative list of differentially expressed genes. We also discuss potential sources of error in RNA-Seq analysis that could alter interpretation of global changes in gene expression.
When comparing statistical tools, the negative binomial distribution-based methods, edgeR and DESeq, respectively identified 11,995 and 11,317 differentially expressed genes from an RNA-seq dataset generated from soybean leaf tissue grown in elevated O3. However, the number of genes in common between these two methods was only 10,535, resulting in 2,242 genes determined to be differentially expressed by only one method. Upon analysis of the non-significant genes, several limitations of these analytic tools were revealed, including evidence for overly stringent parameters for determining statistical significance of differentially expressed genes as well as increased type II error for high abundance transcripts.
Because of the high variability between methods for determining differential expression of RNA-Seq data, we suggest using several bioinformatics tools, as outlined here, to ensure that a conservative list of differentially expressed genes is obtained. We also conclude that despite these analytical limitations, RNA-Seq provides highly accurate transcript abundance quantification that is comparable to qRT-PCR.
KeywordsRNA-Seq Differential Expression Statistical analysis
As a method for characterizing global changes in transcription, RNA-Seq is an attractive option because of the ability to quantify differences in mRNA abundance in response to various treatments and diseases, as well as to detect alternative splice variants and novel transcripts. Compared to microarray techniques, RNA-Seq eliminates the need for prior species-specific sequence information and overcomes the limitation of detecting low abundance transcripts. In addition, early studies have demonstrated that RNA-Seq is very reliable in terms of technical reproducibility. As a result, biologists studying an array of model and non-model organisms are beginning to utilize RNA-Seq analysis with ever growing frequency[3–7]. However, without experience using bioinformatics methods, the large number of choices available to analyze differential expression can be overwhelming for the bench scientist (see Table one in).
Essentially, RNA-Seq consists of five distinct phases, 1) RNA isolation, 2) library preparation, 3) sequencing-by-synthesis, 4) mapping of raw reads to a reference transcriptome or genome and 5) determining significance for differential gene expression (for review see). In an effort to familiarize the bench scientist with the post-sequencing analysis of RNA-Seq data (phase 5), we have developed an analysis strategy based on currently available bioinformatics tools. Here, we compare three statistical tools used to analyze differential gene expression: edgeR, DESeq and Limma[9–11]. Based on their performance, we present an analysis strategy that combines these tools in order to generate an optimized list of genes that are differentially expressed. Finally, we highlight several aspects of RNA-Seq analysis that have the potential to lead to misleading conclusions and discuss options to minimize these pitfalls.
Generating high quality reads is dependent on initial RNA quality
Post sequencing analysis of raw reads
Number of reads
Q.S. (interquartile range)
Utilizing statistical tools that are compatible with RNA-Seq data
The raw reads described in Table 1 were aligned to the soybean reference transcriptome using the mapping tool Novoalign, a short read aligner demonstrated to be highly accurate[13, 14]. When differential expression was analysed subsequently, the total number of genes with significantly altered transcript abundance in plants exposed to elevated ozone was 11,995 for edgeR, 11,317 for DESeq and 9,131 for Limma. Since RNA-Seq generates count data, it is more appropriate to use a discrete probability distribution to analyze differential gene expression. Therefore, edgeR and DESeq, which are based on the negative binomial distribution, are compatible with the data generated by RNA-Seq[9, 10]. In contrast, Limma was adapted to analyze RPKM values using a method previously developed for continuous data from microarray studies (fluorescence values) and is based on the t-distribution. The Limma method was clearly very different from the two negative binomial distribution methods, but even between edgeR and DEseq there were 678 additional genes identified by edgeR as differentially expressed, representing approximately 6% of the significant genes.
Workflow for RNA-Seq data optimization
Comparing the accuracy of RNA-seq data with qRT-PCR
Potential pitfalls and limitations of RNA-Seq analysis
Functional category significance of optimized and marginally significant genes
# of genes
# of genes
# of genes
Major CHO metabolism
Mitochondrial electron transport / ATP synthesis
Statistical limitations are revealed by independent analysis of ozone-responsive genes
Light-harvesting complex II CAB protein
7733.20 ± 783.0
4575.86 ± 429.3
1504.18 ± 149.8
2425.15 ± 206.1
265.49 ± 11.1
220.79 ± 17.8
84.73 ± 2.6
107.98 ± 4.4
70.47 ± 8.6
18.91 ± 9.2
11.31 ± 1.4
34.68 ± 6.8
Protein degradation DER1 like
5.23 ± 0.2
44.52 ± 11.8
2.90 ± 0.4
5.64 ± 1.3
Starch synthase catalytic domain
2.84 ± 0.8
0.12 ± 0.1
WRKY trascription factor
1.69 ± 0.4
70.92 ± 20.80
While the aim of this paper is to familiarize the molecular biologist interested in undertaking an RNA-Seq project with the methods and issues related to post-sequencing analysis, emphasis still needs to be placed on proper handling of RNA samples. Here, we isolated high quality RNA (Additional file1) using a well-established protocol for soybean leaf tissue. In addition, care was taken during the library construction and sequencing-by-synthesis phases, as evidenced by the high quality scores for each sample (Table 1). As a result, the average number of usable reads per sample was >20 million, which is the recommended depth required to quantify differential expression in a species with a referenced genome.
It is also important to utilize a valid experimental design for RNA-Seq projects, which includes the use of biological replicates. Reports demonstrating highly reproducible RNA-Seq results[2, 22] make it tempting to reduce sequencing costs by only using one replicate per treatment group. However, without replication it is impossible to estimate error, without which there is no basis for statistical inference. Therefore, it is recommended that RNA-Seq experiments include at least three biological replicates per treatment group, as was done in the experiment presented here.
Along these lines, it is important to understand the nature of RNA-Seq data and why it is necessary to use a compatible statistical method, such as a negative binomial distribution[9, 10]. For discrete variables such as count data, it is possible to associate all observed values with a non-zero probability. In contrast, there is zero probability that a specific fluorescence value (continuous variable) will be obtained from microarray hybridization. This distinction is important in the context of the varying number of total reads obtained for individual RNA-Seq samples. For example, the probability of mapping 100 reads out of 16.86 million (Table 1; Sample3) for a particular gene is different than mapping 100 reads out of 36.41 million (Table 1; Sample1). To deal with this issue, both edgeR and DESeq normalize the read data based on the total number of reads per sample prior to differential expression analysis.
The main goal of this work was to compare the accuracy of two statistical tools, edgeR and DEseq. At first glance, it appears that both tools perform equally well (Figure 1, Step B). However, when the differentially expressed genes from edgeR and DEseq were intersected (Figure 1, Step C), quite a few genes from each list were eliminated (2,242 total genes). Because of this, we adopted a strategy to identify genes that were determined to be differentially expressed by both edgeR and DESeq. In other words, greater confidence was achieved if a gene was determined significant by each of the statistical tools.
Expression data for low abundance genes
Genes turned on
0.08 ± 0.032
0.30 ± 0.206
0.17 ± 0.054
0.14 ± 0.057
0.06 ± 0.030
Genes turned off
0.20 ± 0.050
0.17 ± 0.084
0.54 ± 0.292
0.57 ± 0.372
0.14 ± 0.013
Another aspect that has the potential to confound RNA-Seq analysis deals with the issue of statistical stringency. In Table 2, we demonstrated that for several functional categories, the marginally significant genes eliminated from the optimized list did, in fact, respond to elevated ozone in a manner similar to the genes ultimately determined to be differentially expressed. Therefore, it may be more appropriate to perform network analysis for individual metabolic or signal transduction pathways using the entire RNA-Seq dataset, not just the genes determined to be differentially expressed. However, this strategy is limited by pathways that have been previously characterized and would fail to uncover new connections, especially unknown signalling relationships.
One final issue revealed by this analysis was the increase in type II error for high abundance genes (Table 3 and Figure 4). Several of the genes determined not to be differentially regulated by one or both of the statistical tools are involved with processes that have been well characterized to be regulated to elevated ozone, including decreased photosynthesis (Glyma05g25810 and Glyma17g37280), increased antioxidant capacity (Glyma11g11460) and increased protein turnover (Glyma20g27950). However, these genes were determined to be differentially expressed based on statistical analysis of RPKM values. This problem undermines the effectiveness of performing RNA-Seq analysis to uncover novel relationships because it will fail to identify all of the high abundance genes that are differentially regulated in response to elevated ozone; genes that are more likely to impact biological processes, especially metabolic functions.
There are many new challenges facing the bench scientist when undertaking an RNA-Seq project, especially regarding the large number of bioinformatics tools that have been developed to analyze the post-sequencing dataset[28–32]. Here, we provide a step-by-step guide for analyzing RNA-seq data. In addition, we identified limitations that exist for widely used methods to determine differential expression of RNA-seq data. Therefore, we suggest that our strategy to merge the common genes identified by multiple tools and examine the eliminated genes is an improvement that better ensures confidence in generating a list of differentially expressed genes. We also demonstrate that the results obtained from a select set of genes using qRT-PCR closely agree with the RNA-Seq data. Because of this high accuracy, we envision RNA-Seq replacing microarrays as the new standard for global transcript quantification.
Soybean plants (Glycine max cv. Be Sweet 292) were grown in environmentally controlled growth chambers for six weeks in either ambient or elevated ozone conditions (150 ppb for 8 h d-1). Tissue was collected from mature leaves and ground to a fine powder in a liquid nitrogen cooled mortar and pestle. Total RNA was isolated following the protocol of Bilgin et al. and DNase treated using the TURBO DNA-free kit (Life Technologies, Grand Island, NY). Each sample (5 μg) was resolved on a 1% agarose gel containing 40 mM MOPS (pH 7.0), 2 mM EDTA (pH 8.0) and 5 mM iodoacetamide. Before loading the gel, each sample was diluted to 10μL with nuclease free water and heated at 70°C for 5 min along with 7.5μL MOPS/EDTA buffer and 5μL formaldehyde (37% wt.).
Library preparation and sequencing-by-synthesis
The DNase-treated RNA (1 μg) was used to prepare individually barcoded RNA-Seq libraries with the TruSeq RNA Sample Prep kit (Illumina, San Diego, CA). Pools of two samples per lane were sequenced on a HiSeq2000 for 100 cycles using version 2 chemistry and analysis pipeline 1.7 according to the manufacturer's protocols (Illumina, San Diego, CA). All raw data has been submitted to the NCBI [GenBank:SRP009826].
Aligning raw reads to the soybean transcriptome
Illumina sequences from each of the samples from three biological replicates of control and treatment (elevated ozone) were cleaned using the FASTX toolkit, with a minimum quality score of 20 and minimum length of 75 nt. Soybean genome (Gmax_109) and gff file (Gmax_109.gff3) were downloaded from phytozome (http://www.phytozome.net/soybean). Soybean transcripts were extracted from the genome sequences based on the.gff file. These soybean transcripts (46,367 transcripts) were considered as reference transcriptome for RNA-Seq analysis.
Mapping of Illumina sequences with Novoalign was done with –H (for hard clipping the reads), –l 65, -rA10 (to allow 10 multiple alignments). With these parameters at least 90% of the each read's length should map to the reference to consider it as a mapped read. After mapping with Novoalign, read counts for each gene were generated using PERL scripts. These reads counts were used for statistical analysis using DESeq and edgeR packages of ‘R’ to determine differential expression at the gene level. Since approximately 92% of the mapped reads aligned to the transcriptome uniquely, multireads were not considered. All biological replicates demonstrated a >0.93 correlation when RPKM values were compared, indicating high reproducibility of replicates. See online user guides for more information about performing alignments with Novoalign (http://www.novocraft.com/wiki/tiki-index.php).
Gene lengths and count data for the three independent control and ozone-treated replicates were used to analyze differential expression using R software (Version 2.13.0). The Limma-RPKM method is based on a two-group Affymetrix dataset design included as part of the Limma package[11, 17]. For the edgeR analysis, the trimmed mean of the M values method (TMM; where M = log2 fold change) was used to calculate the normalization factor and quantile-adjusted conditional maximum likelihood (qCML) method for estimating dispersions was used to calculate expression differences using an exact test with a negative binomial distribution[9, 15, 34]. For the DESeq analysis, differential expression testing was performed using the negative binomial test on variance estimated and size factor normalized data. All p-values presented were adjusted for false discovery rate to control for type I error due to multiple hypothesis testing. The programming code for each of the specific packages can be found by viewing the vignette details in R using the 'openVignette()' command.
Log2 fold change values were loaded into the MapMan expression tool to link gene identifiers with functional annotations using the Gmax_109_peptide mapping file. This tool automatically analyzes functional category significance base on the Wilcoxon rank sum test.
Differential expression of RPKM normalized data was tested by ANOVO and corrected for multiple comparisons following the methods of Benjamini and Hochberg (1995) with a false discovery rate of 0.25 using SAS (Version 9.2, Cary, NC; Table 4).
First-strand cDNA synthesis was performed using 1 μg of DNase treated RNA and was reverse transcribed in a 20 μl reaction with Superscript II (Life Technologies, Grand Island, NY) and oligo(dT) primers according to the manufacturer's instructions. Quantitative PCR was performed on an Applied Biosystems 7900HT Fast Real-Time PCR System (Life Technologies, Grand Island, NY) using Power SYBR Green PCR master mix (Life Technologies, Grand Island, NY) and 400nM of each primer in a 10 μl reaction. Primers were aliquoted onto a 384-well PCR plate using a JANUS automated liquid handling system (Perkin Elmer, Waltham, MA). The following are the primer sequences for each of the target genes: Rubisco (Glyma19g06340), primer A- GCACAATTGGCAAAGGAAGT, primer B- GAGAAGCATCAGTGCAACCA; LHCA5 (Glyma06g04280), primer A- GTGGAGCATCTTTCCAATCC, primer B- TGGATAAGCTCAAGCCCAAG; SBPase (Glyma11g34900), primer A- ATAAGTTGACCGGCATCACC, primer B- GGGTTGTCAGATGTGGCTCT; starch synthase (Glyma13g27480), primer A- GACCCTCTCGATGTTCAAGC, primer B- ATTCTCTGAGGTGGCAATGG; glutaredoxin (Glyma13g30770), primer A- AATCCAATGGCACCTATCCA, primer B- AGGGTTCACTCCCAGACCTT. Target gene expression was normalized to cons14. Each PCR amplification curve was analyzed with LinRegPCR software to calculate the PCR efficiency and threshold value from the baseline-corrected delta-Rn values in the log-linear phase. The normalized expression level for each gene was determined as reported in.
Availability of supporting data
The data set supporting the results of this article is included within the article (and its additional files).
The authors would like to thank Matt Hudson for critical review of the manuscript and Alvaro Hernandez and the High-Throughput Sequencing and Genotyping Unit in the W.M. Keck Center for Comparative and Functional Genomics at the University of Illinois at Urbana-Champaign for carrying out the library preparation and RNA sequencing. This work was funded by the National Soybean Research Laboratory's Soybean Disease Biotechnology Center.
- Wang Z, Gerstein M, Snyder M: RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMedPubMed CentralView Article
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMedPubMed CentralView Article
- Brautigam A, Gowik U: What can next generation sequencing do for you? Next generation sequencing as a valuable tool in plant research. Plant Biology. 2010, 12: 831-841. 10.1111/j.1438-8677.2010.00373.x.PubMedView Article
- Nowrousian M: Next-generation sequencing techniques for eukaryotic microorganisms: sequencing-based solutions to biological problems. Eukaryot Cell. 2010, 9: 1300-131015. 10.1128/EC.00123-10.PubMedPubMed CentralView Article
- Perez-Enciso M, Feretti L: Massive parallel sequencing in animal genetics: wherefroms and wheretos. Anim Genet. 2010, 41: 561-56913. 10.1111/j.1365-2052.2010.02057.x.PubMedView Article
- Croucher NJ, Thomson NR: Studying bacterial transcriptomes using RNA-seq. Curr Opin Microbiol. 2010, 13: 619-624. 10.1016/j.mib.2010.09.009.PubMedPubMed CentralView Article
- Sutherland GT, Janitz M, Kril JJ: Understanding the pathogenesis of Alzheimer's disease: will RNA-Seq realize the promise of transcriptomics?. J Neurochem. 2011, 166: 937-946.View Article
- Garber M, Grabher MG, Guttman M, Trapnell : Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8: 469-477. 10.1038/nmeth.1613.PubMedView Article
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009, 26: 139-140.PubMedPubMed CentralView Article
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106-10.1186/gb-2010-11-10-r106.PubMedPubMed CentralView Article
- Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article 3-
- Schmutz , et al: Genome sequence of the palaeopolyploid soybean. Nature. 2010, 463: 178-183. 10.1038/nature08670.PubMedView Article
- Ruffalo M, LaFramboise T, Koyuturk M: Comparative analysis of algorthms for next-generation sequencing read alignment. Bioinformatics. 2011, 27: 2790-2796. 10.1093/bioinformatics/btr477.PubMedView Article
- Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010, 11: 473-483. 10.1093/bib/bbq015.PubMedPubMed CentralView Article
- Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23: 2881-2887. 10.1093/bioinformatics/btm453.PubMedView Article
- Cloonan , et al: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619. 10.1038/nmeth.1223.PubMedView Article
- Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, Springer, New York, 397-420.View Article
- Ainsworth EA, Yendrek CR, Sitch S, Collins WJ, Emberson LD: The effects of tropospheric ozone on net primary production and implications for climate change. Annu Rev Plant Biol. 2012, 63: 637-661. 10.1146/annurev-arplant-042110-103829.PubMedView Article
- Thimm O, Blaesing O, Gibon Y, Nagel A, Meyer S, Krüger P, Selbig J, Müller LA, Rhee SY, Stitt M: MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J. 2004, 37: 914-939. 10.1111/j.1365-313X.2004.02016.x.PubMedView Article
- Bilgin DD, DeLucia EH, Clough SJ: A robust plant RNA isolation method suitable for Affymetrix GeneChip analysis and quantitative real-time RT-PCR. Nat Protoc. 2009, 4: 333-340. 10.1038/nprot.2008.249.PubMedView Article
- Li H, Lovci MT, Kwon YS, Rosenfeld MG, Fu XD, Yeo GW: Determination of tag density required for digital transcriptome analysis: Application to an androgen-sensitive prostate cancer model. PNAS. 2008, 105: 20179-20184. 10.1073/pnas.0807121105.PubMedPubMed CentralView Article
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.PubMedView Article
- Fisher RA: The design of experiments. 1951, Edinburgh, Oliver and Boyd Ltd, 6
- Auer P, Doerge RW: Statistical design and analysis of RNA sequencing data. Genetics. 2010, 185: 405-416. 10.1534/genetics.110.114983.PubMedPubMed CentralView Article
- Leakey ADB, Xu F, Gillespie KM, McGrath JM, Ainsworth EA, Ort DR: Genomic basis for stimulated respiration by plants growing under elevated carbon dioxide. PNAS. 2009, 106: 3597-3602. 10.1073/pnas.0810955106.PubMedPubMed CentralView Article
- Conklin PL, Barth C: Ascorbic acid, a familiar small molecule intertwined in the response of plants to ozone, pathogens, and the onset of senescence. Plant Cell Environ. 2004, 27: 959-970. 10.1111/j.1365-3040.2004.01203.x.View Article
- Pell EJ, Schlagnhaufer CD, Arteca RN: Ozone-induced oxidative stress: Mechanisms of action and reaction. Physiol Plant. 1997, 100: 264-273. 10.1111/j.1399-3054.1997.tb04782.x.View Article
- Howe EA, Sinha R, Schlauch D, Quackenbush J: RNA-Seq analysis in MeV. Bioinformatics. 2011, 27: 3209-3210. 10.1093/bioinformatics/btr490.PubMedPubMed CentralView Article
- Cumbie , et al: GENE-Counter: a computational pipeline for the analysis of RNA-Seq data for gene expression differences. PLoS One. 2011, 6: e25279-10.1371/journal.pone.0025279.PubMedPubMed CentralView Article
- Zhao WM, et al: wapRNA: a web-based application for the processing of RNA sequences. Bioinformatics. 2011, 27: 3076-3077. 10.1093/bioinformatics/btr504.PubMedView Article
- Wang L, Si YQ, Dedow LK, Shao Y, Liu P, Brutnell TP: A low-cost library construction protocol and data analysis pipeline for Illumina-based strand-specific multiplex RNA-Seq. PLoS One. 2011, 6: e26426-10.1371/journal.pone.0026426.PubMedPubMed CentralView Article
- Zytnicki M, Quesneville H: S-MART, a software toolbox to aid RNA-seq data analysis. PLoS One. 2011, 6: e25988-10.1371/journal.pone.0025988.PubMedPubMed CentralView Article
- R Development Core Team: R: A language and environment for statistical computing. 2011, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, URLhttp://www.R-project.org/,
- Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321-332.PubMedView Article
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995, 57: 289-300.
- Libault M, Thibivilliers S, Bilgin DD, Radwan O, Benitez M, Clough SJ, Stacey G: Identification of four soybean reference genes for gene expression normalization. Plant Genome. 2008, 1: 44-54. 10.3835/plantgenome2008.02.0091.View Article
- Ruijter JM, Ramakers C, Hoogaars WM, Karlen Y, Bakker O, van den Hoff MJ, Moorman AF: Amplification efficiency: linking baseline and bias in the analysis of quantitative PCR data. Nucleic Acids Res. 2009, 37: e45-10.1093/nar/gkp045.PubMedPubMed CentralView Article
- Gillespie KM, Rogers A, Ainsworth EA: Growth at elevated ozone or elevated carbon dioxide concentration alters antioxidant capacity and response to acute oxidative stress in soybean (Glycine max). J Exp Bot. 2011, 62: 2667-2678. 10.1093/jxb/erq435.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.