The field of evolutionary biology changed drastically with the advent of next generation sequencing technologies. One thing that has stayed the same, however, is the importance of a well-planned experimental design, which ensures the data we collect have the power to answer our questions of interest. We also must still consider budgetary constraints since few (if any) of us have unlimited research funding. With Sanger sequencing, we worried about the number of individuals vs the number of loci to sequence (Felsenstein 2006; Carling and Brumfield 2007) and that question is much the same today- what is the optimal combination of number of samples/replicates/individuals vs sequencing depth per sample for next generation sequencing?
From a transcriptomic perspective, Liu et al. (2014) set out to answer the question of samples vs sequences by explicitly testing the trade-off between increasing biological replicates versus increasing sequencing depth in an effort to detect differentially expressed genes in RNAseq data.
Liu et al. extracted RNA from MCF7 human breast cancer cell lines cultured in the presence of 17b-estradiol and under control conditions. They constructed libraries for 7 biological replicates of each treatment using the Illumina TruSeq RNA kit and collected at least 30 million (M) 50 base pair reads per replicate sample on an Illumina HiSeq2000. Next, the authors randomly subsampled the RNAseq reads to create datasets of 2.5, 5, 10, 15, 20, 25, and 30M reads per library and used two programs, edgeR and DESeq to test for differential gene expression between the control and experimental treatments. The results for the subsampled datasets were compared to the results obtained by analyzing all 7 biological replicates and all 30M reads for the hormone-regulated treatments. The findings are summarized below.
Differential gene expression
The number of differentially expressed (DE) genes increased both with an increase in the number of replicates and an increased number of reads per sample. However! After 10M reads, the increase in the number of DE genes discovered diminishes with increasing sequence depth.
For example, at a sequencing depth of 10M reads, using two biological replicates for a total of 20M combined reads, the average number of DE genes identified was 2011.When we used 15M reads and two biological replicates for a total of 30M combined reads, the number is 2139, a 6% increase for a 50% increase in reads. If instead we applied an additional 10M reads to another biological replicate (three biological replicates for a total of 30M combined reads), we obtained an average of 2709 DE genes, a 35% increase.
One potential concern is that with a reduction in sequencing depth, you will lose information from lowly expressed genes since many programs remove genes with less than 5 reads from the analysis. However, Liu et al. found that for libraries with at least 10M reads, reducing sequence depth had a small effect on the number of genes removed.
Log fold change
Liu et al. estimated the accuracy of individual gene log fold change (logFC) under different levels of biological replication and sequencing depth by calculating the logFC coefficient of variation for the top 100 most differentially expressed genes. As with the previous results, 10M reads seems to be the sweet spot- increasing reads above 10M had little effect on the coefficient of variation when biological replication was high. In fact, “high replication levels gave accuracies that are probably not practically achievable by adding sequencing depth at low replication levels.”
Lui et al. calculated the coefficient of variation of the log counts per million (logCPM) for groups of genes with low, medium, and high expression levels and plotted it against the sequencing depth and replication level. As might be expected, for highly expressed genes, accuracy of expression level was already high and adding replicates increased accuracy while adding reads had little effect. For genes with low expression, the coefficient of variation was much larger and accuracy improved with either the addition of replicates or reads. For genes in the middle group, adding reads reduced the coefficient of variation slightly and adding replicates reduced the coefficient of variation significantly.
These results indicate that biological replicates improve the accuracy in estimating expression level for all genes, regardless of expression level, whereas adding sequencing depth will improve estimation accuracy mostly for low expression genes.
Take home message
Although the results may vary depending on the organism or experimental treatment of interest, and some questions still require deep sequencing (i.e. differential expression of exons, transcript-specific expression), according to the results of Liu et al., adding biological replicates is generally more powerful than adding sequencing depth for transcriptomic studies.
Liu, Y., Zhou, J., & White, K. P. (2014). RNA-seq differential expression studies: more sequence or more replication?. Bioinformatics, 30(3), 301-304. DOI: 10.1093/bioinformatics/btt688
Felsenstein, J. (2006). Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci? Molecular Biology and Evolution, 23(3), 691-700. DOI: 10.1093/molbev/msj079
Carling, M. D., & Brumfield, R. T. (2007). Gene sampling strategies for multi-locus population estimates of genetic diversity (θ). PLoS One, 2(1), e160. DOI: 10.1371/journal.pone.0000160