The evolution of phylogeography in the next gen era: 20 years in review

Phylogeographers have long known about the limitations of single locus studies (ie, the effects of selective sweeps, stochasticity in lineage sorting among loci) and that adding loci improves the accuracy of demographic parameter estimates. As we continue to shift towards collecting multi-locus datasets thanks to high throughput sequencing, some interesting questions have come up. For example, what is the best ratio of genetic loci to individuals sampled? What is the role of mitochondrial (mtDNA) and chloroplast (cpDNA) loci in the next gen era? And most broadly, how has the field of phylogeography itself evolved in the last 20 years since the advent of high throughput sequencing?
Garrick et al. (2015) tackled these questions by exploring how phylogeography datasets have changed in the last 20 years. The authors collected empirical papers published in Molecular Ecology from 1992 to 2013 that had the search term term “phylogeograp*” in the title, abstract, keywords, or main text, sampling at 3 year intervals. The search resulted in over 1,200 hits. From these papers, the authors recored the following metrics:

  • total number of independent loci sampled (complete mtDNA or cpDNA genomes were treated as a single haploid locus)
  • total number of alleles sampled (identical alleles contributed to the count- this gave an idea of the number of individuals sampled per study)
  • total length in base pairs of DNA sequences collected
  • total number of SNPs identified
  • number of species surveyed (ie were data collected from a single species or was it a multi-species comparative study?)

The final dataset analyzed by Garrick et al. contained 508 single-species datasets drawn from 370 papers.

Fig. 2 Linear regression of a weighted metric (number of loci 9 total number of alleles sampled, log-transformed) as a function of time, partitioned by major taxonomic group. (a) vertebrates (N = 272 data sets). (b) invertebrates (N = 153). (c) plants (N = 52). (d) fungi, protists, algae and bacteria combined (i.e. ‘other,’ N = 16)

Figure and caption from Garrick et al 2015. Linear regression of a weighted metric (number of loci x total number of alleles sampled, log-transformed) as a function of time, partitioned by major taxonomic group. (a) vertebrates (N = 272 data sets). (b) invertebrates (N = 153). (c) plants (N = 52). (d) fungi, protists, algae and bacteria combined (i.e. ‘other,’ N = 16)


An increase in the size of phylogeographic datasets was found across most major taxonomic groups (see figure above) in terms of the number of loci and the number of alleles sampled, suggesting researchers are putting more effort into collecting genomic and geographic samples.
The use of mtDNA and cpDNA loci has declined in the last two decades, but few datasets contained autosomal loci only. As pointed out by Garrick et al., organellar markers are still useful for questions about sex-biased dispersal, directional introgression, and molecular rate estimation and therefore, “are unlikely to become obsolete, but rather will continue to represent an important part of the phylogeography toolbox.”
Using exploratory forecast modeling, Garrick et al. predicted that the number of SNPs per data set is likely to reach ~20,000 by the end of 2016 (95% CI 16,590 – 23,133) which represents more than a doubling over the preceding 3 year period (see figure below).
Forward-time projection of the total number of single nucleotide polymorphisms (SNPs) per published phylogeo- graphic data set, through to the end of the year 2016. Forecasts were generated using autoregressive integrated moving aver- age (median values in black, 95% confidence intervals in pale grey), conditioned on survey data spanning 1992–2013, sam- pled at 3-year intervals. For each year, only the five highest values for the total number of SNPs are shown

Figure and caption from Garrick et al. 2015 Forward-time projection of the total number of single nucleotide polymorphisms (SNPs) per published phylogeo- graphic data set, through to the end of the year 2016. Forecasts were generated using autoregressive integrated moving aver- age (median values in black, 95% confidence intervals in pale grey), conditioned on survey data spanning 1992–2013, sam- pled at 3-year intervals. For each year, only the five highest values for the total number of SNPs are shown


An interesting conclusion from the survey is the author’s claim about the field of landscape genetics, a topic that my fellow TME contributor Rob Denton wrote about last week (Landscape genetics gets existential). According to Garrick et al.:

…in the era of next-generation sequencing, the perceived distinction between landscape genetics and phylogeography (e.g. Wang 2010) increasingly represents a false dichotomy, as the resulting large DNA sequence data sets should be informative over a broad temporal spectrum. Indeed, the timescales on which inferences can be made are likely to depend more on geographic sampling of individuals than on choices relating to genetic data (Robinson et al. 2014a).

Another interesting finding of Garrick et al.’s analyses is that the number of individuals sampled has increased along with the increase in the number of loci being collected. Is this because we are obsessed with the idea that more data are always better? Or because the savings we accumulate as the cost of sequencing goes down are being spent adding more individuals to the experimental design? I wrote a few weeks ago that adding replicates trumps increasing sequencing depth in testing for differential gene expression but what is the optimum ratio of loci to individuals sampled now that phylogeographic studies are on pace to collect 20,000 SNPs per dataset? It feels a bit like blasphemy to write this but perhaps we can afford to scale back the number of individuals we sample per population and instead devote our time and money to collecting from additional geographic locations or to other projects entirely. Now that Garrick et al. have summarized how far the field has come in the last 20 years, I am excited to see where phylogeography goes next.
Reference
Garrick, R. C., Bonatelli, I. A., Hyseni, C., Morales, A., Pelletier, T. A., Perez, M. F., … & Carstens, B. C. (2015). The evolution of phylogeographic data sets. Molecular Ecology, 24 (6), 1164-1171. DOI: 10.1111/mec.13108
 

This entry was posted in evolution, genomics, Molecular Ecology, the journal, next generation sequencing, phylogeography, Uncategorized. Bookmark the permalink.