Many population genomics studies use methods that provide a reduced representation of the genome, for example RADseq or UCEs. Targeting a subset of the genome reduces the cost of sequencing and makes data analyses less computationally intensive. In their recent Molecular Ecology paper, De Wit, Pespeni, and Palumbi (2015) suggest using expressed gene sequences (i.e. transcriptomic data) as a reduced representation method for population genomic studies. The review walks through the strengths (greater accuracy of functional annotation) and weaknesses (potential biases such as allele-specific expression) of this approach.
This review summarizes current methods, identifies ways that using expressed sequence data benefits population genomic inference, and explores how current practitioners evaluate and overcome challenges that are commonly encountered. We focus particularly on the additional power of functional analysis provided by expressed sequence data and how these analyses push beyond allele pattern data available from non-function genomic approaches.
The review contains sections on 1) assembly quality, 2) marker development and genotyping after assembling the transcriptome, and 3) applications of expressed sequence datasets. Based on a summary of the current literature, the authors also provide a “best-practice” pipeline that starts with experimental design and sequencing platform choice and moves to raw data quality control, transcriptome assembly and evaluation, ending with a reference transcriptome ready for SNP discovery. Below I will highlight a few points I found to be interesting or useful from each section and conclude with a few new questions the review raised in my mind.
Poorly assembled transcriptomes can result in the creation of false SNPs, where nucleotide differences in paralogous sequences are mistaken for polymorphisms, and the omission of real SNPs if allelic differences are considered as belonging to two separate genes instead of one. These errors can be reduced using tBLASTn searches querying your translated sequences against a high quality protein database from a closely related species, which will reveal incorrectly collapsed contigs when multiple orthologous proteins from the reference match a single collapsed contig. In the absence of a well annotated reference, the authors suggest a conservative approach that keeps allelic variants as separate “genes.” This will reduce the number of false positives, albeit at the expense of potentially missing real SNPs.
Normalizing libraries to correct for extremely highly expressed genes reduces the number of sequencing errors. Normalization can be done bioinformatically after sequencing with programs such as khmer. However, the effect of digital normalization on qualitative measurements of assembly (such as percent of conserved orthologs identified) has not yet been tested. Normalizing the libraries during the library prep stages will also reduce overly expressed transcripts, thereby increasing the representation of lowly expressed and potentially interesting transcripts. However, normalization at this stage prevents the collection of quantitative information about transcript abundance making expression analyses impossible.
Marker development and genotyping
Population genomics studies often use pooled population-wide data to directly estimate allele frequencies. Ideally, both alleles at a given locus in a heterozygous individual would be equally represented in the data, as would each individual in a pooled sample. However, data are often skewed for technical or biological reasons, such as PCR artifacts or allele-specific expression (when one allele is more highly expressed than the other), respectively. There is currently an active debate how these factors could potentially mislead genotype estimates. De Wit et al. advocate estimating allele-specific expression before analyzing pooled data and outline several approaches to do so.
Population genomics is often interested in finding outlier loci under selection. Selection may act on many loci within regulatory or metabolic networks in such a way that individual loci fail to meet the stringent significance thresholds of outlier tests. However, De Wit et al. point out that by testing whether loci with high Fst are non-randomly clustered into functional categories, it is possible to infer the role of selection despite the absence of individually significant loci (for examples, see Pespeni et al. 2013 and De Wit et al. 2014).
A traditional use of transcriptomic data is to examine differences in gene expression among populations, which often play a role in population divergence and local adaptation. Combining expression data (determined from read counts) with SNP genotypes (determined from read sequences) allows for the examination of the functional role of SNPs in gene expression.
As a scientist whose past and current research has focused on understanding the mechanisms responsible for allele frequency and gene expression differences among populations, it is extremely appealing to envision generating a single dataset that can be used to test both sets of questions. In their best practices pipeline, De Wit et al. suggest using sequence data collected across many different developmental phases, tissue types, sexes, and physiological stages for transcriptome assembly. They also suggest collecting 30-100 million reads to balance the coverage needed for gene discovery and downstream gene expression analyses with the accumulation of sequencing errors.
Determining how many and which individuals to sequence and the necessary depth of coverage are universal decisions for all NGS projects and here De Wit et al. outline the gold standard experimental design for projects using transcriptomic data for population genomics. Ideally, all our experiments would meet this gold standard but resources are often limiting and this review made me wonder: can a single experimental design provide the data required to test hypotheses about population genomics AND differential gene expression? Or does one question have to take a back seat? How far we can deviate from the best practices outlined here before we lose confidence in our results? Furthermore, in which questions (gene expression vs allele frequency differences) would we lose power to answer first?
De Wit et al. clearly demonstrate that determining SNP genotypes and differential gene expression simultaneously from transcriptomic data shows much promise for evolutionary ecology and I am excited to see how this approach evolves over time.
De Wit, P., Pespeni, M. H., & Palumbi, S. R. (2015). SNP genotyping and population genomics from expressed sequences – current advances and future possibilities. Molecular Ecology. DOI: 10.1111/mec.13165