Steelhead in a random forest: identifying the genetic basis of migration

Genome-wide association studies (GWAS) have been quite successful in identifying variants associated with various phenotypes (I suppose there is some debate surrounding this statement. For an interesting, if dated, discussion look here). While most of this work was originally conducted on model organisms, more recently the methods have been applied to natural populations and have shown promising results.
Conservation genetics, in particular, is a field which stands to benefit from association studies. For example, think about if we can select for a Tasmanian Devil resistant in the face of cancer tearing through the population. In their recent paper, Hess et al. use association testing to identify the genetic basis of migration timing in steelhead trout, a salmonid of conservation concern.
Steelhead are migratory anadromous salmonids; individuals are born in fresh water, spend 1-4 years maturing in the ocean, then migrate back to fresh water to spawn. What is interesting about the system is that there are distinct summer and winter runs back to fresh water (see figure below). The individuals that migrate during the summer face easier passage upstream due to higher water flow but risk high mortality while waiting for spring; winter individuals experience the opposite. However, individuals within a drainage are more genetically similar than those between drainages, indicating that the separate runs interbreed.

Migration timing in steelhead. From Hess et al., 2016.

Migration timing in steelhead. From Hess et al., 2016.

Hess et al., sampled 237 fish from summer, winter, and intermediate runs over three years and genotyped all individuals using RAD-seq. They combined a standard GWAS approach with a much less standard multivariate random forest machine learning algorithm. I was (and still am) only vaguely familiar with random forest approaches, though machine learning is all the rage these days. I’ll try to give a very brief overview.
In short, a random forest uses subsets of the SNP data to build predictive models for the migration trait. By building many many different models, it is possible to assess the relative contribution of each SNP to the model’s accuracy. If you use only a portion of the data to build a model, and then use the rest of the data to predict the phenotypes with that model, you can assess the proportion of the trait variance explained by your predictive model. It is then possible to sequentially remove the least important SNPs until the maximum phenotypic variance is explained. These remaining SNPs are those that are important for the phenotype.
Using these methods, the GWAS identified three SNPs associated with migration time (explaining 7% of the variation). What is more impressive is that the random forest approach explained 44% of the variation with only 18 SNPs. Using these 18 SNPs, the authors then assessed the ability to identify migration phenotypes using Structure and a Mantel test. In short, these top SNPs could generally differentiate between the differently timed runs yes (see figure below).
Results from structure and mantel tests for top 18 SNPs (A, B) and a neutral subset of SNPs (C, D). From Hess et al., 2016.

Results from structure and mantel tests for top 18 SNPs (A, B) and a neutral subset of SNPs (C, D). From Hess et al., 2016.

There are two main things that stand out to me about this study.

  1. The random forest approach seems promising. It captured much more variation than the GWAS alone and really increased the author’s predictive power.
  2. The conservation implications for this type of study are vast. Using these data, 18 SNPs could help managers differentiate between the runs and prioritize one over the other.

To conclude, this is an interesting study that takes a relatively rare approach to identify SNPs influencing phenotype. I’m looking forward to seeing how other studies begin to utilize this type of machine learning method.
Hess, J.E., Zendt, J.S., Matala, A.R. and Narum, S.R., 2016. Genetic basis of adult migration timing in anadromous steelhead discovered through multivariate association testing. Proc. R. Soc. B (Vol. 283, No. 1830, p. 20153064). 10.1098/rspb.2015.3064

This entry was posted in association genetics, bioinformatics, conservation, genomics, next generation sequencing. Bookmark the permalink.