Comparing your options for phylogenomic data

MARY COLTER (R) SHOWING BLUEPRINT TO MRS ICKES (WIFE OF SECRETARY OF INTERIOR) CIRCA 1935. NPS.
The choices for current-generation (last generation?) molecular markers are grouped in two primary camps.
First, the “reduced representation” methods: take some DNA, cut it up with specific enzymes, tag those pieces, read the sequences. These methods produce lots and lots of single nucleotide polymorphisms (SNPs) and can be used for just about any taxon your heart desires. The most common acronyms you’ve read are probably RADseq (restriction site associated DNA sequencing) and ddRADseq (double digest restriction site associated DNA sequencing).
Second, the “targeted enrichment” methods: buy some probes that attach to highly-conserved areas of DNA and sequence them along with the flanking regions. These methods provide loci that are more likely to be found across divergent taxa, which both expands the scope of phylogenetic questions and reduces missing data. The most common acronyms you’ve read are probably UCE (ultra-conserved elements) and exon-cap (exon-capture or anchored hybrid enrichment).
Even though these molecular resources are relatively new, it is incredible how quickly the “word on the street” pigeonholes certain tools for certain questions. For example, maybe you’ve been told that you can’t use RADseq for phylogenetic questions. Maybe you’ve heard that UCEs are only helpful for the deepest of nodes in a phylogeny. I’m not sure how these statements are perpetuated, but a recent pre-print from Rupert Collins and Tomas Hrbek may be a good starting point for those intrepid researchers who are asking themselves what molecular data will take them to the promise land of phylogenetic inference.
The authors downloaded 23 complete primate genomes in order to manufacture reduced representation (RADseq and ddRADseq) and targeted enrichment (UCE and exon-capture) datasets. RADseq and ddRADseq data were collected by simulating where the enzymes used in those protocols would cut pieces of DNA from the whole genomes. In a perfect world, these pieces would be the same as what you would be left with if the entire RAD protocol was conducted at your lab bench. UCE and exon-capture data was collected by converting those whole genomes into separate BLAST databases so the authors could search for the specific probes associated with either approach. In addition to these four methods, Collins and Hrbek obtained two additional datasets for some of the primate species, one based on Sanger-sequenced exons and one based on mitochondrial DNA.
Four general characteristics were compared across the four (plus two extra) sets of loci:

  1. the number of recovered loci and proportions of missing data
  2. topological uncertainty and statistical support for resolving nodes
  3. consistency in branch length estimates
  4. phylogenetic informativeness

The results aren’t surprising in many ways. Methods that used conserved sites (UCE and exon-capture) produce loci that are more likely to be recovered across taxa. Reduced representation methods produce a greater number of loci.
However, for most of the nodes on the primate tree, all four methods effectively resolved the “real” solution. Everybody wins!

relaxed clock divergence times

Figure 6 from Collins and Hrbek – Mean divergence time estimates with 95% credible intervals at selected nodes


Within the reduced-representation methods, ddRADseq produces many fewer loci than RADseq, but allows investigators to greatly increase the number of samples using the same coverage. The smaller number of ddRADseq loci weren’t able to resolve the oldest nodes on the primate tree, but both methods produced similar clades ages and levels of phylogenetic informativeness:

When compared to the results from the sequence capture methods, it is possible that the RADseq protocol generated more data than was actually necessary for resolving the phylogeny over the time scales studied here. However, the RADseq and ddRADseq data also have much higher relative, and in the case of RADseq also absolute information content, and thus are likely a better choice for resolving relationships at the population to species boundaries.

There were fewer differences between the sequence capture methods, as both confidently produced the correct tree. Exon-capture methods produced fewer (~1/4) loci compared to UCE, but those exon-capture loci had lower dropout rates and lower numbers of missing sites:

Comparing the UCE and exon-cap protocols, the latter provided the most complete data matrix, was least affected by phylogenetic divergence between taxa, and also displayed the most reliable, constant rate PI [phylogenetic informativeness] profile for molecular dating. With their greater degree of standardization and lower anonymity of the loci, both protocols also offer a more reliable solution to data sharing.

Okay, lots of small differences (including cost!), but who cares? If the taxa you study have large variations in divergence rate, have speciated rapidly in the past or recently, or some other evolutionary scenario that makes like difficult for a phylogeneticist, you will care.
Inevitably, unavoidably, inescapably, the right choice is dependent on the question you’re asking. Once you know that, Collins and Hrbek’s in silico study provides a nice starting point to finding your in situ data.
 
Cited
Collins, R. A., & Hrbek, T. (2015). An in silico comparison of reduced-representation and sequence-capture protocols for phylogenomics. bioRxiv, 032565.
 

Posted in methods, next generation sequencing, phylogenetics | Tagged , , | 2 Comments

It's not you, it's my genes: Sexual fidelity tradeoffs in prairie voles

The adorable, (socially) monogamous prairie vole


Many of you may probably already know the monogamous prairie vole as the yin to the promiscuous montane vole’s yang. Prairie voles are socially monogamous, which is an extremely rare trait among mammals. This trait has made the prairie vole the focus of decades of research on the biology (and neurobiology) of monogamy. The plethora of research has identified two neurotransmitters, vasopressin and oxytocin (aka, the “love hormone”), as key players in the formation and maintenance of the pair bond.
But recent work has shown that not all prairie voles are completely monogamous — some offspring are sired by neighboring males (i.e., not their caregiving father). And the variation in vasopressin receptors (V1aR) in two brain regions, the laterodorsal thalamus (LDThal) and the retrosplenial cortex, predicts which males will wander and mate with extra-pair females. Interestingly these two regions are part of the memory circuit, which led researchers to hypothesize that the males with low V1aR (a vasopressin receptor) in lDThal and RSC have crappy memory, which causes them to repeatedly wander into locations where they previously got their butts kicked by another male. But because they keep wandering into these other territories, they increase their chances of mating with another female!
So a new study published in Science investigated the V1aR gene of prairie voles to see what gene regulatory mechanisms might be responsible for these differences and in what situations selection might favor monogamy versus promiscuity.
Continue reading

Posted in adaptation, evolution, genomics, next generation sequencing, selection | Tagged , , | 3 Comments

Gracilaria , currywurst and aebleskivers

Another travelogue for a Monday afternoon!
Our first official European stop on the Gracilaria vermiculophylla tour was in Germany and Denmark hosted by a colleague without whom we wouldn’t have been able to embark on this adventure!

Helgoland fading into the North Sea

Helgoland fading into the North Sea


I first met Florian Weinberger at a German phycological meeting in 2006 on the island of Helgoland (a phycologist’s dream apart from the weather). We’ve since shared many of the same model organisms, from Chondrus to Gracilaria, but only with this current project, were we able to finally start a formal collaboration!
I met up with Erik at the airport in Hamburg and we made our way to Kiel where we met up with Dr. Weinberger.
IMG_3247
Continue reading

Posted in blogging, conservation, evolution, haploid-diploid, natural history, population genetics | Tagged , , , , , , | Leave a comment

The biggest problem in landscape genetics and how to fix it

shutterstock_114417286
Landscape genetics is a field that has expanded rapidly in recent years, but that doesn’t mean that it hasn’t gone without criticism. Perhaps the largest problem with landscape genetics (LG) studies is one of timing. If you observe genetic differentiation between two populations, is it truly due to the contemporary landscape or a more-historical process? This is often described as “time lag”, the number of generations it takes for a population split/coalescence to manifest itself in genetic data. Although the strategies for incorporating genetic data with landscape variables have blossomed with the increase in LG studies, the strategies for separating these historical and contemporary landscape effects have not.
An upcoming review in Molecular Ecology by Clinton Epps and Nusha Keyghobadi lays the time lag issue on the table, reviews current work that proposes solutions, and makes recommendations for future strategies.
The first step is asking the question: What affects an investigator’s ability to detect a pattern in genetic data that is driven by landscape?

  1. The parameter you measure. Measures of inbreeding, heterozygosity changes more slowly than something like Fst, but there are other alternatives like conditional genetic distance and the proportion of shared allele distance
  2. Not just the parameter (response variable), but also the analytical method that calculates them.
  3. The molecular marker under the microscope. Specifically, how quickly do certain groups of genetic loci change over time. What about loci under selection?
  4. Generation time of the taxon
  5. Direction of change. Equilibrium will happen slower when populations are fragmented than the opposite, when populations are reconnected following the removal of a barrier.
  6. Dispersal, pop sizes, structure, dynamics….the list goes on, you get it

If these are the causes for genetic lag, what are the solutions?

Landscape genetics decision tree provided by Epps and (2015)

Landscape genetics decision tree provided by Epps and Keyghobadi (2015)


Epps and Keyghobadi lay out some detailed approaches that have been incorporated by other researchers, including strategies for when historical landscape information is know and when the landscapes of the past are a mystery.
Getting some idea of the historical landscape is the most helpful strategy to control for effects of past landscape on observed genetic patterns. You can look for historical data in traditional sources of ecological knowledge, like fire history maps, archival maps, and vegetation surveys. Alternatively, you can piece together an idea using combinations of past climate data, geological records, and ecological niche models.
Secondly, varying the type of analysis or molecular marker can provide at least a broad idea of differences in time scale between inferences of connectivity. One example would be combining microsatellite data and mtDNA data to assess connectivity at contemporary (BayesAss), historical (Migrate-n), and even more historical time scales (mtDNA divergence). Simulations are suggested as an important tool for creating expectations of time lag for multiple markers and various methods of analysis, adding this review to the “simulations are underused in molecular ecology” folder.
Finally, if you’re lucky, just have samples from the present and past.
After all of these approaches, the authors provide a unique spin on the time lag problem. That is, considering time lags as the measurement of interest:

We propose that an as-yet little exploited approach could be to take advantage of time lags in genetic structure to establish baselines for connectivity conservation. For instance, where known barriers to species dispersal have recently been constructed, rather than conducting LG analyses to determine whether an effect on genetic structure can be detected, LG analyses that consider and estimate time lags could show where the disconnect between pre- and post-fragmentation connectivity is greatest.

Epps, C. W., & Keyghobadi, N. (2015). Landscape genetics in a changing world: disentangling historical and contemporary influences and inferring change. Molecular Ecology.

Posted in methods, Molecular Ecology, the journal, population genetics | Tagged , | 2 Comments

An Oedipus complex in mosses?

Nannandrous … phyllodioicous … gotta love botanical terms and these will most definitely find their way into this week’s list of favorite words! Both refer to the tiny epiphytic nature of males situated on much larger female shoots. There may be many hundreds of the so-called dwarf males per female shoot. This type of sexual system may decrease intersexual competition while increasing the potential for outcrossing and polyandry. Indeed, polyandry enables male-male competition and post-fertilization selection as well as halting the spread of selfish genetic elements.
Mosses are the only known sessile terrestrial organisms with epiphytic dwarf males. Spore dispersal in mosses is leptokurtic. So, males spores of closely related species and those produced by nearby or even the same female shoot may mature on a female. Females may, therefore, not be too choosy.

Homalothecium lutescens © nhgardenssolutions.wordpress.com

Homalothecium lutescens © nhgardenssolutions.wordpress.com


Few studies have addressed patterns of inbreeding versus outcrossing in haploid-diploid organisms, it’s unclear the extent to which a dwarf male sexual system leads to on or the other as compared to species without dwarf males.
Continue reading

Posted in bioinformatics, evolution, genomics, haploid-diploid, plants, population genetics | Tagged , , , | Leave a comment

The next, next generation: long reads facilitate assembly & annotation in large genome species

Delicious wheat bread. This photograph is not paleo approved. Photo credit Mireya Merritt

Delicious wheat bread. This photograph is not paleo approved. Photo credit Mireya Merritt


The typical procedure for constructing a draft genome or transcriptome using current second generation, high throughput sequencing platforms involves generating short reads about 150 base pairs long, assembling those short reads into larger contigs, putting the contigs in the correct order to create chromosome sequences, and finally annotating protein-coding genes and other elements (for example, introns, transposons, etc). The assembly of contigs can be complicated by a number of factors, particularly if the genome of the species of interest is very large (perhaps due to past genome duplication events), if there are many highly repetitive regions, and/or if there are many highly similar members of multigene families. Ideally, generating full length reads (as opposed to short reads) would help improve assembly of problematic genomic regions, but generating very long sequences is labor intensive. Continue reading

Posted in genomics, next generation sequencing, plants, Uncategorized | 1 Comment

Long distance dispersal of modern humans outside of Africa

Long distance dispersal (LDD) has long known to be an artifact of human migrations out of Africa. However, the effects of LDD on modern human diversity, and models of LDD in human colonization are yet to be characterized. Using an ABC (Approximate Bayesian Computation) framework, Alves et al. (2015) estimate probabilities of four plausible scenarios of migration of anatomically modern humans – (1) simple range expansion out of East Africa, with a “stepping-stone” model of migration between adjacent populations, (2) model (1), with range contraction due to the Last Glacial Maximum (LGM) in Europe and Asia, (3) model (1) with LDD events to previously unoccupied demes, and (4) comprising all three events. Their simulations utilized diversity data from across 50 microsatellite loci, bootstrapped from 87 filtered loci from Tishkoff et al. (2009) and Pemberton et al. (2009). Their analyses clearly rejected models without LDD, with strongest support for model 4, also confirmed through accuracy and goodness of fit estimations. Interestingly, their findings reveal greater support for LDD into previously occupied demes, than for previously unoccupied demes in Eurasia. Estimates of demographic parameters (ancestral and current population sizes, migration rates, growth rates) under the model with LDD were in agreement with previous estimates.

Posterior probability density distributions under the four models of human expansion out of Africa simulated using 1000 bootstrap datasets. Image courtesy: Figure 2 of Alves et al. (2015) http://dx.doi.org/10.1093/molbev/msv332

Posterior probability density distributions under the four models of human expansion out of Africa simulated using 1000 bootstrap datasets. The full model (4) indicates the highest density. Image courtesy: Figure 2 of Alves et al. (2015) http://dx.doi.org/10.1093/molbev/msv332


Outstanding questions that Alves et al. (2015) discuss on the basis of these findings include the effects of archaic human introgression into modern humans outside of Africa on demographic expansion, ascertainment bias while utilizing SNP, and LD data for similar studies of demography, and Neolithic population growth in tandem with LDD events to offer insights into currently observed genetic diversity.

…LDD events from the core to the front might have quickly restored diversity and reshuffled the genetic diversity of populations in Eurasia. These LDD events might also explain why the gene pool of many human populations shows signals currently interpreted as admixture events between isolated populations (Moorjani, et al. 2013; Patterson, et al. 2012; Pickrell, et al. 2014) that could just represent normal patterns having been built since the exit of modern humans from Africa.

Reference:
Alves, Isabel, et al. “Long distance dispersal shaped patterns of human genetic diversity in Eurasia.” Molecular Biology and Evolution (2015): msv332. DOI: http://dx.doi.org/10.1093/molbev/msv332
Tishkoff, Sarah A., et al. “The genetic structure and history of Africans and African Americans.” Science 324.5930 (2009): 1035-1044. DOI: http://dx.doi.org/10.1126/science.1172257
Pemberton, Trevor J., et al. “Sequence determinants of human microsatellite variability.” BMC genomics 10.1 (2009): 612. DOI: http://dx.doi.org/10.1186/1471-2164-10-612

Posted in evolution, genomics, natural history, population genetics | Tagged , , , , , | 2 Comments

Pre-adapted algal ancestors colonized land

The colonization of land by plants 450 Mya marked a major transition on Earth and was one of the critical events that led to the emergence of extant terrestrial ecosystems.
Chief among the challenges the terrestrial environment presented for these early algal colonizers was acquiring nutrients, but the exact mechanisms that enabled these challenges to be overcome are not well understood.
Until now …
In a new paper in PNAS, Delaux et al. (2015) propose

the algal ancestor of land plants was preadapted for interaction with beneficial fungi [that improved a plant’s ability to capture nutrients] and employed these gene networks to colonize land successfully.

Continue reading

Posted in adaptation, bioinformatics, Coevolution, evolution, genomics, haploid-diploid, next generation sequencing, phylogenetics, plants, transcriptomics | Tagged | Leave a comment

Life fast, diapause young: The African turquoise killifish genome

killifish


Your newly sequenced genome isn’t going to get into Nature, Science, or Cell just because it “hasn’t been done before”. You need to have a hook. And speaking of hooks, there are two new fish genome papers out in Cell! (and you’re welcome for that punny transition)
Continue reading

Posted in genomics, natural history, next generation sequencing, selection, Uncategorized | Tagged , , | Leave a comment

Best laid plans of algae and academics oft go astray

When you’re stuck in and feel some procrastination is in order … write another travelogue post!
I’ve wanted to spin some yarns about field mishaps. There’s no way we could sample over 45 sites without something going wrong.
For our Northeast sampling leg, I’ve been pondering whether to just talk about the field or someone’s research. But, since this leg was a comedy of errors, I thought it would be a light hearted way to go into the weekend.
The plan was for myself and a student to head north and spend several days scouting sites from New York City north to Great Bay in New Hampshire.
In 2014, I had scouted sites in Maine and didn’t find any Gracilaria vermiculophylla. So, we had a few known populations from work out of Carol Thornber‘s lab, but nothing definite planned apart from Adam’s Point in Great Bay. A Gracilaria road trip was in store.
Then came along a little thing called demonic intrusion … a wicked, amorphous thing that never hesitates to materialize at the least opportune time.
Less than a week before we left, it became clear we needed everyone in the lab as both Erik and I were out in the field. I was bereft of a field buddy and sampling mudflats I’d never been too. Possibly a bad combination …
Even when you have a field buddy, it doesn’t guarantee you’ll not get stuck and be on your own to free yourself. In the end, all the Northeastern sites were sandy beaches with rocks and maybe a bit of mud, but totally workable by your lonesome. Oh hindsight …
But, in the panic before I left, I must have jinxed myself and will, therefore, digress for a brief bit of self-depricating humor.
Continue reading

Posted in blogging, community, evolution, haploid-diploid, population genetics | Tagged , , , | 3 Comments