Impact of RNA-seq attributes on false positive rates in differential expression analysis of de novo assembled transcriptomes
© González and Joly; licensee BioMed Central Ltd. 2013
Received: 18 June 2013
Accepted: 20 November 2013
Published: 3 December 2013
High-throughput RNA sequencing studies are becoming increasingly popular and differential expression studies represent an important downstream analysis that often follow de novo transcriptome assembly. If a lot of attention has been given to bioinformatics tools for differential gene expression, little has yet been given to the impact of the sequence data itself used in pipelines.
We tested how using different types of reads from the ones used to assemble a de novo transcriptome (both differing in length and pairing attributes) could potentially affect differential expression (DE) results. To investigate this, we created artificial datasets out of long paired-end RNA-seq datasets initially used to build the assembly. All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes. If the false positive rate for differential gene expression does not seem to be strongly affected by sequencing strategy (max. of 3.5%), it could reach 12.2% or 28.1% for differential isoform expression depending of the pipeline used. The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.
In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.
Keywordsde novo transcriptome assembly Differential gene expression Differential isoform expression Non-model organisms False positive rates RNA-seq
Recent advances in massively parallel sequencing technologies have created huge opportunities in the field of transcriptomics [1–4]. High-throughput RNA sequencing, or RNA-seq, together with the development of powerful computational methods, have given researchers the opportunity to study non-model organisms by assembling de novo transcriptomes, i.e. without a reference genome or transcriptome [5–7]. Such computational methods are being increasingly used as species for which a referenced genome or transcriptome exists represent a tiny fraction of all species.
Beyond transcriptome assembly, one of the great advantages of RNA-seq is to allow gene or isoform expression quantification and the evaluation of their differential expression under different conditions. By providing direct quantification of all transcripts of a sample, RNA-seq data has the potential to overcome several limitations of previous approaches that were limited to small scaled studies (quantitative PCR) or that required important genomic knowledge and resources (microarrays) . For instance, RNA-seq data has been rapidly found to outperform microarrays approaches .
Despite these great promises, accurate and reliable differential expression studies are still challenging and a lot of attention has been given to bioinformatics tools [10, 11]. These challenges involve, for example, mapping sequencing reads to a reference transcriptome and statistically testing for differential expression. Yet, surprisingly little attention has been given to the impact of sequence types used as input on differential expression analyses. Practically, differential expression studies requires two decisions in terms of sequencing parameters, i.e. RNA-seq reads length and the possibility of sequencing RNA fragments from one end (single-end reads) or both ends (paired-end reads). Of course, long paired-end reads are the best choice for constructing an assembly as they outperform single-end alignment in terms of both sensitivity and specificity . However, single-end reads might be favored for economic reasons, especially for gene quantification as an increase in sensitivity and specificity might not always be required. For instance, although paired-end reads are generally thought to be imperative for correctly estimating isoform-specific expression, they might not be required for gene expression. As such, it could seem appropriate to assemble a transcriptome using paired-end reads, but to use single-end reads when quantifying gene expression on biological replicates. This common belief has rarely been tested and little is known about the impact of sequencing decisions on differential expression results accuracy.
The objective of this study is to test how sequencing strategies (i.e. sequencing length and pairing options) affects false positive rates in differential expression results at both gene and isoform level. We created different types of sequence datasets from a long paired-end sequencing run and analyzed the datasets with each other to evaluate the impact of the sequence type on false positive rates. Our results show that the choice of sequence used in differential gene expression studies can sometimes have an important impact on the accuracy of the results.
Results and discussion
Raw data and trinity assemblies statistics
Number of sequences
Transcriptome length (bp)
Short-read sequences mapping
Percentage of initial reads mapped back to the de novo transcriptome
Differential expression analyses
Comparing Figures 2 and 3 shows that pipeline strategies do not have much impact on DE at gene level: differences observed in false positive rates are minor. Interestingly, these results suggest that the use of a gapped alignment, which may seem unnecessary as reads are directly mapped to transcripts, does not affect the results. In fact, Bowtie2 showed extremely good results when coupled to eXpress, the latter handling multimapping very well. At the isoform level, on the contrary, the second pipeline clearly outperforms the first one (Figures 2 and 3). This result is not totally unexpected as EBSeq approach was developed mainly for isoform differential expression studies . Because pipeline 2 gives better results at the isoform level compared to pipeline 1, only results from this pipeline will be discussed in the following for concision and clarity purposes. However, the pipeline choice does not affect our conclusions.
In both species, reducing sequence length had little impact on gene expressions (Figure 3, “Length” lines): FPR adds up to 0.3% for PE sequences and about 1% for SE sequences. In the isoform approach, FPR gets approximately six times higher: about 2% for PE sequences and 6% for SE sequences. Sequence length influences FPR as shorter reads have potentially more multimapping events than longer reads as they are less specific. This uncertainty results in different estimated counts for genes or isoforms compared to longer sequences.
Paired-end vs. single end reads
To investigate the effect of reads pairing in differential expression analyses, DE was tested between samples in which counts were estimated with paired-end sequences (100 and 50 bp) to samples in which counts were estimated with single-end sequences (100 and 50 bp). Gene DE analysis of paired versus unpaired sequences resulted in FPR of 1.7% and 1.6% (100 bp) and 3.1% and 2.7% (50 bp), for Salix and Rhytidophyllum respectively (Figure 3, “Type” lines). Whereas shortening sequence lengths had little consequences in gene DE analyses, removing pairing attributes had more impact on FDR. As for sequence length, isoform DE is more affected in terms of FDR: for Salix, FPR reaches 8.9% (100 bp) and 12.2% (50 bp), while it is 8.2% (100 bp) and 11.3% (50 bp) for Rhytidophyllum. Again, shorter hence less specific sequences result in higher FPR.
Lastly, DE results between datasets that varied in both sequence types and lengths produced the expected results given the previous observations on length and pairing attributes. The highest FPR were observed when longest PE sequences (100 bp) were tested against the shortest SE sequences (50 bp). Both Rhytidophyllum and Salix showed a FPR of 2.9% for gene analysis and close to 12% for isoform analysis (Figure 3, “L + T” lines). When shorter PE sequences (50 bp) were tested against longer SE sequences (100 bp), false positive rates were slightly lower both for genes (1.8% and 1.9% for Rhytidophyllum and Salix, respectively) and isoforms (8.4% and 9.4% for Rhytidophyllum and Salix, respectively).
Gene or isoform expression
FDR suggest that the impact of the type of sequence used on DE is greater for isoforms than for genes (Figures 2, 3). Overall, isoform DE analysis on Salix led to 4.6 times more false positives than gene DE analysis. Interestingly, very similar proportions were found for Rhytidophyllum (4.7 ×). These results are not really surprising. Indeed, mapping uncertainty resulting from smaller and SE sequences is expected to be most important between isoforms of a single gene and less rarely among genes. Consequently, it is reassuring that gene DE is less affected by sequence type. Nevertheless, one could argue that 1% of false positive is not completely trivial considering the number of genes involved in such analyses.
Effect of the transcriptome assembly strategy
Limitations of the study
Our objective here was not to thoroughly explore all possible parameters that could affect DE analyses with respect to the sequencing strategy (e.g. sensitivity). Instead, we wanted to investigate whether DE analyses could be affected by the choice in sequencing strategy and broadly quantify this error. We thus acknowledge that there are limitations to the extrapolation of our conclusions to other conditions or organisms. A first limitation is related to the number of DE analyses pipelines investigated. Although more pipelines could have been explored, this was clearly not the aim of the study as aspects of different pipelines have been compared elsewhere [1, 3, 21–24]. Our approach was to use very distinct approaches to validate our results. Because the two pipelines gave similar results, we think the bioinformatics aspects of this study did not affect the main conclusions. Another limitation comes from the fact that only two plants have been studied and it consequently might be difficult to extrapolate our results to other organisms. Yet, because these two species diverged more than 100 mya  and because different tissues were used (buds and flowers), we think our results are probably quite general in plants and that they could even perhaps be extended to other eukaryotes that have similar transcriptome characteristics (e.g., size, isoform numbers, etc.). Finally, we purposely did not include any biological replicate. Such replicates are mandatory when analyzing differential expression as it allows distinguishing treatment effects from individual variance within treatments. Adding biological replicates would probably have resulted in finding fewer DE transcripts and genes in our analyses. But we deliberately chose to ignore any replicate because we want to solely observe the pure consequence of a given sequencing strategy. Hence, our results show the intrinsic error due to the sequence type in a DE experiment.
Rhytidophyllum vernicosum transcript counts for a gene and total number of transcript counts for 100 bp PE and 50 bp SE sequence datasets
100 bp PE counts
50 bp SE counts
Overall, our results suggest that paired-end sequence is relatively crucial for obtaining precise isoform differential expression. We also suggest using paired-end sequences for gene DE, even though this is less critical. Indeed, the mapping uncertainty remains important for gene count estimation with single-end sequences: comparisons of counts obtained with paired-end vs. single-end resulted in almost 2% false positives. If someone is to make economies, the best solution seems to be to sequence 50 bp paired-ends rather than 100 bp paired-ends (for fragments of the same length). This approach seems to result in very small differences in count estimation for both genes and isoforms.
Prior to assembling reads, Trimmomatic  was used to remove bad quality Illumina RNA-seq data and trim poor quality nucleotides at the beginning and the end of each sequence. The following parameters were used in the command line: LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:40. For the assembly, Trinity software  was used to reconstruct the transcriptome de novo using default settings. Gene sequences were obtained using isoform union method consisting on qualifying as “gene”, the union of transcripts identified by Trinity as isoforms of the same gene.
We used two different approaches to map RNA-seq reads to the reference transcriptome: Bowtie  that maps reads to a reference without allowing any gap, and Bowtie2  that allows gaps during mapping. Bowtie was run as a part of Trinity pipeline. The following parameters were used in the command line: alignReads.pl --left R1.fastq --right R2.fastq --target Trinity.fasta --seqType fq --aligner bowtie --max_dist_between_pairs 800 -- -p 16. In Bowtie 2, the following parameters were used: Bowtie2 -X 800 -p 16 -x Bowtie_Index -1 R1.fastq -2 R2.fastq | samtools view -Sb.
We used two different algorithms to compute abundances: an expectation-maximization algorithm (RSEM ) and an online method algorithm (eXpress ). As part of Trinity proposed pipeline (pipeline 1), RSEM was used to calculate isoform and gene abundances. The following parameters were used in the command line: run_RSEM_align_n_estimate.pl --transcripts Trinity.fasta --seqType fq --left R1.fq --right R2.fq. In pipeline 2, eXpress was coupled to bowtie2 aligner to calculate isoforms and genes abundances.
The function of differential expression analysis is to point up isoforms or genes for which abundances changed significantly across experimental conditions. EdgeR , used in pipeline 1, handles the lack of biological replicate by simulating it, although the variance parameter was hard to evaluate. We chose to use 0.01 for this parameter since this is the value proposed for technical replicates. We ran EdgeR as part of the Trinity pipeline with the following command line: run_DE_analysis.pl --matrix counts.matrix --method edgeR –dispersion 0.01. EBSeq , used in pipeline 2, was developed specifically to counter biases in isoform differential expressions. We followed the EBSeq user manual instructions and used 15 iterations for convergence at a FDR of 5%.
The authors to thank Julie Marleau and Hermine Alexandre for laboratory work, Werther Guidi for providing willow plant material, and Adriana Almeida-Rodriguez, Frederic Pitre and Michel Labrecque for support with this project. The authors also acknowledge the help of the Genome Québec Innovation Centre for advices and sequencing. This study was supported by a Genome Canada and Genome Quebec grant (Genorem Project) and a NSERC discovery grant (SJ).
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.PubMedView ArticleGoogle Scholar
- Gahlan P, Singh HR, Shankar R, Sharma N, Kumari A, Chawla V, Ahuja PS, Kumar S: De novo sequencing and characterization of Picrorhiza kurrooa transcriptome at two temperatures showed major transcriptome adjustments. BMC Genomics. 2012, 13: 126-10.1186/1471-2164-13-126.PubMedPubMed CentralView ArticleGoogle Scholar
- Ward JA, Ponnala L, Weber CA: Strategies for transcriptome analysis in nonmodel plants. Am J Bot. 2012, 99: 267-276. 10.3732/ajb.1100334.PubMedView ArticleGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMedPubMed CentralView ArticleGoogle Scholar
- Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng HW: Comparative studies of de novo assembly tools for next-generaten sequencing technologies. Bioinformatics. 2011, 27: 2031-2037. 10.1093/bioinformatics/btr319.PubMedPubMed CentralView ArticleGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, et al: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.PubMedView ArticleGoogle Scholar
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMedPubMed CentralView ArticleGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Fonseca NA, Rung J, Brazma A, Marioni JC: Tools for mapping high-throughput sequencing data. Bioinformatics. 2012, 28: 3169-3177. 10.1093/bioinformatics/bts605.PubMedView ArticleGoogle Scholar
- Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNA-seq: a matter of depth. Genome Res. 2011, 21: 2213-2223. 10.1101/gr.124321.111.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010, 11: 473-483. 10.1093/bib/bbq015.PubMedPubMed CentralView ArticleGoogle Scholar
- Chang S, Puryear J, Cairney J: A simple and efficient method for isolating RNA from pine trees. Plant Mol Biol Report. 1993, 11: 113-116. 10.1007/BF02670468.View ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMedPubMed CentralView ArticleGoogle Scholar
- Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinforma. 2011, 12: 323-10.1186/1471-2105-12-323.View ArticleGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359. 10.1038/nmeth.1923.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7: 562-578. 10.1038/nprot.2012.016.PubMedPubMed CentralView ArticleGoogle Scholar
- Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C: EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013, 29 (8): 1035-1043. 10.1093/bioinformatics/btt087.PubMedPubMed CentralView ArticleGoogle Scholar
- Lindner R, Friedel CC: A comprehensive evaluation of alignment algorithms in the context of RNA-seq. PLoS One. 2012, 7: e52403-10.1371/journal.pone.0052403.PubMedPubMed CentralView ArticleGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8: 469-477. 10.1038/nmeth.1613.PubMedView ArticleGoogle Scholar
- Zhao QY, Wang Y, Kong YM, Luo D, Li X, Hao P: Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinforma. 2011, 12 (14): S2-View ArticleGoogle Scholar
- Oshlack A, Robinson MD, Young MD: From RNA-seq reads to differential expression results. Genome Biol. 2010, 11: 220-10.1186/gb-2010-11-12-220.PubMedPubMed CentralView ArticleGoogle Scholar
- Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinforma. 2013, 14: 91-10.1186/1471-2105-14-91.View ArticleGoogle Scholar
- Soltis DE, Bell CD, Kim S, Soltis PS: Origin and early evolution of angiosperms. Ann N Y Acad Sci. 2008, 1133: 3-25. 10.1196/annals.1438.005.PubMedView ArticleGoogle Scholar
- Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B: RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012, 40: W622-W627. 10.1093/nar/gks540.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.