Environmental association analyses: a practical guide for a practical guide

"Do you have the kind with latent variables? I think it is a yellow bottle?"

“Do you have the kind with latent variables? I think it’s a yellow bottle?”

Obtaining extensive SNP data for your organism of choice isn’t such a feat these days, but actually matching that breadth of data with appropriate analyses is still a challenge. Fortunately, there has been an avalanche of new methods to make these connections between genetic variation and environment more clear. Unfortunately, the recent surge in new methodologies sure makes decision making tough. What are the drawbacks of different methods? What works for my data? Why didn’t I think of this before I generated all these SNPs?
If only there was some sort of….practical guide.

That’s the goal of a new article by Christian Rellstab and colleagues that has been recently accepted to Molecular Ecology . They offer a thorough and thoughtful guide to environmental association analysis (EAA) to provide a good starting point for those looking to start a project or just analyze their data.
Box 1 from Rellstab et al. (2015) shows the relationships between sources of variation (boxes), the cause and effect of evolutionary processes (black arrows), and selection (gray arrows).

Box 1 from Rellstab et al. (2015) shows the relationships between sources of variation (boxes), the cause and effect of evolutionary processes (black arrows), and selection (gray arrows).

The basics
Data – You need some sort of genetic polymorphism data. While multiple varieties are acceptable to use in EAA, the review focuses on SNPs. You also need environmental data that matches the spatial resolution of the genetic data.
The choice of data and its subsequent preparation involves a lot of thought. What resolution am I using? Are these variables correlated? How many loci/samples are appropriate?  Rellstab et al do a nice job of reviewing these issues and providing citations for multiple solutions and approaches.
Sampling design – If you want to ask questions about environmental variation, one of the go-to sampling schemes in biology is finding a gradient to sample along. While this seems like second nature to most, sampling along a gradient also makes replication (within environmental variables and evolutionary lineages) a difficult task. Also, gradients are usually designed around one variable (temperature, salinity, etc), but the effects of environment on genetic data may be due to co-varying environmental parameters that aren’t accounted for. Other sampling schemes, like sampling more intensely in areas that are categorically different from one another (low vs high temperature sites) or sampling broadly across the entire range or niche of a taxon, offer their own disadvantages. Essentially, environmental variables and genetic data often co-vary in complex ways that make it difficult to suggest a “best” study design. Instead, keep these disadvantages in mind while maximizing the fit between genetic/environmental data and what hypothesis you really want to test.
Environmental Association Analysis
To the main point: what are my options for EAA?
For categorical factors – This is pretty straightforward and the most traditional approach: compare allele frequencies between different categories of environmental variables in replicate. This may be the best approach is you have a very specific environmental variables, a strong expectation of what sort of statistical spread of that variable is required, and the ability to reliably replicate those conditions. For most, this is rare.
Logistic regressions – This family of analyses quantify the relationship between an environmental factor and the presence/absence of an allele.

Sampling individuals from diverse habitats or along environmental gradients is ideally suited for this type of analysis.

Software applications of logistic regressions for EAA include the spatial analysis method (SAM) and the recent extended version SAMβADA.
Matrix correlations – These analyses test for correlations between matrices of distance or dissimilarity. This includes full and partial Mantel tests, a subject we’ve written about previously on this blog.
General linear models – These models treat the environmental response variables as a linear function of genetic variables. If that seems backwards to you, you aren’t alone:

In EAA, however, environment instead of phenotype is used as response variable. Since the environment experienced by an organism is not caused by its genotype, this might seem conceptually counterintuitive. It is assumed, however, that environmental factors that are strongly correlated with heritable traits can replace them in statistical models.

For multivariate versions of these analyses, canonical correlation analysis (CCA) or redundancy analysis (RDA) offer the ability to account for polygenic adaptive traits. In the case of RDA, model testing among variations of environmental parameters, neutral genetic structure, or spatial effects are available.
Mixed effects models – Finally, the extensive set of analyses that model allele frequencies (response variable), environmental parameters (fixed factors), and neutral genetic structure (random factor). The advantage here is a standardized and statistically intuitive way to deal with neutral genetic structure. However, each program/technique has a different way to test for significance and choosing the type of association (linear, rank, logistic). Some options here include BAYENV2, latent factor mixed models (LFMMs), efficient mixed-model association (EMMA), and TASSEL.
Your best bet: mix and match
When there are so many methods available, there are bound to be contrasting strengths and weaknesses among them. The authors provide suggestions for combinations of EAAs that might help in various scenarios. For example, want to pull apart potentially-adaptive loci? How about first performing an outlier test (BAYESCAN, FDIST, ARLEQUIN) and then feed those outliers into an EAA?
If you have the appropriate sampling scheme and data, leaning on analyses that explicitly account for neutral genetic variation (mixed effect models) provides the most straightforward solution. This is the most in-depth section of the review, and for good reason.
Check yourself
The identification of adaptive loci and the environmental variation that shapes them requires validation. Lots of it. Either from within a dataset, between analysis, or by other researchers, validation is key for the generalization of these phenomenon.
The good news is that there are now, more than ever, analytical resources for finding environmental associations to start with. But if your Holy Grail is finding the gene(s) that determine the adaptation of wild organisms to their environments, you have a whole career’s worth of work in front of you:

Many studies, including most of those described in this review, perform EAA, present a list of interesting loci, compare it to GO databases and stop there, i.e., half way to the goal of identifying those genes that are functionally involved in local adaptation of natural populations. Instead, studies should go further and test their findings using, e.g., independent populations, knock-out mutants, common garden and reciprocal transplant experiments. The effort of such follow-ups should, however, not be underestimated.

The most practical advice? Read the paper and get to work!
Rellstab, C., Gugerli, F., Eckert, A. J., Hancock, A. M., & Holderegger, R. (2015). A practical guide to environmental association analysis in landscape genomics. Molecular Ecology. DOI: 10.1111/mec.13322

This entry was posted in adaptation, association genetics, methods, Molecular Ecology, the journal and tagged . Bookmark the permalink.