Gene expression analysis- are we doing it wrong?


In the last few weeks, three new preprints have come out suggesting that like Jack Butler dropping his kids off at school in the movie Mr. Mom, when it comes to differential gene expression analyses, we’re doing it wrong.
Popular methods give excessive false-positives
Rocke et al. tested the type I error rate (i.e. concluding a gene is significantly differentially expressed when it is not) by reanalyzing previously published datasets and simulating new datasets that had no real differences between them, and should therefore have zero differentially expressed genes.

Some caption

The number of expected and detected differentially expressed genes in simulated data (10,000 genes) analyzed using edgeR and limma-voom. Table from Rocke at al.


Their results suggest that methods that use the negative binomial distribution (such as edgeR and DESeq, among others) had alarmingly high numbers of false positives. Alternatively, the number of differentially expressed genes found using the limma package, which was originally developed for microarray analyses and fits a standard linear model for each gene, was much closer to the expected values.

The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dispersion is high, rather than for low-count genes.

Normalization procedures hinder detection of true gene expression levels
Transcriptomic data analyses commonly include a step that normalizes gene expression levels, making them comparable across samples. A key assumption of common normalization methods (such as median and quantile normalization for microarrays and RPKM and TMM for RNA-Seq data) is that most genes are not differentially expressed. In their recent preprint Roca et al. show this null model of “lack of variation” is unrealistic and prevents us from detecting real variation in expression levels.
In an effort to fix this problem, Roca et al. developed two new methods, median condition-decomposition normalization and standard-vector condition-decomposition normalization. The authors tested their new methods along with two standard approaches (mean and quantile normalization) on a dataset generated from Enchytraeus crypticus, an annelid worm used in standard ecotoxicology tests, exposed to 51 experimental treatments. The results showed that in contrast to the tradition approaches, the novel normalization methods detected much greater variation between conditions in the dataset.

The image modified from Figure 1 in Roca et al. 2015 shows expression levels for three replicates of each of experimental treatments represented by different colors. The black lines represent the medians. Each plot represents a different normalization method.

The image modified from Figure 1 in Roca et al. shows the interquartile ranges of expression levels for three replicates of each of experimental treatments represented by different colors. Black lines represent the medians. Each plot represents a different normalization method, with the methods developed by Roca et al. on the right


Failing to account for batch effects
In a recent PNAS paper, Lin et al. reported that gene expression data from humans and mice clustered according to species, not tissue, a finding that contradicted much previous research and evidence that major developmental pathways are conserved across mammals. Gilad and Mizrahi-Man* reanalyzed the mouse-human gene expression data in question and found that the unexpected results could be explained by a flawed experimental design. Using information from the raw RNA-Seq reads, Gilad and Mizrahi-Man inferred which sequences belonged to which sequencing runs (see the figure below) and when these batch effects were accounted for, the corrected gene expression data tened to cluster by tissue, not by species.
Image taken from Figure 1 of

Image taken from Figure 1 of Gilad and Mizrahi-Man 2015


*Because Gilad and Mizrahi-Man submitted their manuscript to F1000Research, an “open science publishing platform for life scientists, offering immediate publication and transparent refereeing, avoiding editorial bias and ensuring the inclusion of all source data,” you can read not only the manuscript, but the peer reviews and comments from others scientists regarding the findings of Lin et al. and Gilad and Mizrahi-Man.
References
David M. Rocke, Luyao Ruan, Yilun Zhang, J. Jared Gossett, Blythe Durbin-Johnson, Sharon Aviran (2015) Excess false positive rates in methods for differential gene expression analysis using RNA-Seq data. Biorxiv DOI: 10.1101/020784
Carlos P. Roca, Susana I. L. Gomes, Mónica J. B. Amorim, Janeck J. Scott-Fordsmand (2015) A novel normalization approach unveils blind spots in gene expression profiling. Biorxiv DOI: 10.1101/021212
Yoav Gilad, Orna Mizrahi-Man (2015) A reanalysis of mouse ENCODE comparative gene expression data. Faculty of 1000 Research DOI: 10.12688/f1000research.6536.1
Lin, S., Lin, Y., Nery, J. R., Urich, M. A., Breschi, A., Davis, C. A., … & Snyder, M. P. (2014). Comparison of the transcriptional landscapes between human and mouse tissues. Proceedings of the National Academy of Sciences,111, 17224-17229. DOI: 10.1073/pnas.1413624111

This entry was posted in Uncategorized. Bookmark the permalink.