Opening Pandora’s box: PSMC and population structure

Essentially, all models are wrong, but some are useful. — George Box

Population size change vs population structure. Source: Orozco-terWengel, Heredity 2016

Population size change vs population structure. Source: Orozco-terWengel, Heredity 2016

Publication of the Li and Durbin’s 2011 paper titled “Inference of human population history from individual whole-genome sequences” was a milestone in the inference of demography.

By allowing the estimation of population dynamics from a single diploid genome, Li and Drubin’s pairwise sequentially Markovian coalescent (PSMC) model is perfectly suited for the genomic era of “less is more”, i.e. sequencing whole genomes of a few individuals rather than sequencing few loci of many individuals.

“The distribution of the time since the most recent common ancestor (TMRCA) between two alleles in an individual provides information about the history of change in population size over time.” (Li and Durbin 2011)

PSMC uses the coalescent approach to estimate changes in population size. Each diploid genome is a collection of hundreds of thousands independent loci. Estimating TMRCA of the two alleles at each locus is used to create a TMRCA distribution across the genome. And since the rate of coalescent events is inversely proportional to effective population size (Ne), PSMC identifies periods of Ne change. For example, when many loci coalesce at the same time, it is a sign of small Ne at that time.

This approach is becoming extremely popular in whole genome studies and is of a particular interest in ancient DNA and conservation genomics. Among others, it has been applied to study demographic history of the giant panda (Zhao et al. 2012), passenger pigeon (Hung et al. 2014) and the woolly mammoth (Palkopoulou et al. 2015).

However, PSMC has several considerable limitations that should be kept in mind.

  • It doesn’t recover sudden changes in Ne
  • Nor does it recover recent changes, e.g. younger than 10,000 years BP in humans (Li and Durbin 2011).
  • Simulation suggest that it also performs worse in case of very ancient changes in Ne (Mazet et al. 2015).
  • Using incorrect mutation rate or generation time can cause bias in the interpretation.
  • The change in Ne in a PSMC plot can be actually caused by population structure.

Continue reading

RedditDiggMendeleyPocketShare and Enjoy
Posted in bioinformatics, methods, Paleogenomics, population genetics, theory | Tagged , , , , | 3 Comments

The Fourth Reviewer: What problem is open peer review trying to solve?

(Flickr: Alan Levine)

(Flickr: Alan Levine)

Tim Vines is an evolutionary ecologist who found his calling in the process of peer review. He was Managing Editor of Molecular Ecology from 2008 to 2015, launched The Molecular Ecologist in 2010, and is the founder and Managing Editor of Axios Review. Here, Tim is The Fourth Reviewer, taking on your questions about peer review and publishing. Got a question for the Fourth Reviewer? Send us an e-mail!

Q. I’ve seen you take numerous pot shots at open peer review. What gives?

I have to admit that I’m not a believer in open peer review. It’s always seemed like the potential downsides of the solution outweigh of the problems it purportedly addresses. To be clear, I’m talking about peer review where the reviewers know the identities of the authors, and the authors are given the identities of the reviewers. Signed review, where the names of the reviewers are made public alongside the accepted article, is an entirely different beast and won’t be discussed here.

There are three critical issues that I see with open peer review:

1) The potential for revenge against critical reviewers. This quote in a piece about open peer review by Jon Tennant particularly struck me:

Early career researchers are perhaps the most conservative in this arena as they may feel afraid that by signing overly critical reviews (i.e., those which investigate the research more thoroughly), they will become targets for retaliatory backlashes from senior figures. The traditional double blind process therefore offers, in theory, a sort of protection for those in junior positions. … In a perfect world, we would expect that strong, honest and constructive feedback would be well received by senior researchers, but there is an apprehension that this would not be the case. However, retaliations to referees in such a negative manner are serious cases of academic misconduct, and likely to be dealt with as such.

Unfortunately, scientists are human beings. There will always be power asymmetries between them, and everyone will at times have the opportunity and desire to take revenge on others they feel have given them unfair criticism. Since this revenge would take the form of rating an enemy’s top-quality manuscript or grant proposal as ‘not great’, there is just no way that it could be detected, let alone result in disciplinary action.

Continue reading

Posted in community, peer review, science publishing, The Fourth Reviewer | Tagged | 3 Comments

Docker: making our bioinformatics easier and more reproducible

This is a guest post by Alicia Mastretta-Yanes, a CONACYT Research Fellow assigned to CONABIO, Mexico. Her research uses molecular ecology and genomic tools to examine the effect of changes on species distributions due to historical climate fluctuations as well as the effect of human management and domestication. You can find more about her research in her website: http://mastrettayanes-lab.org. She tweets about reproducible research, genomics and cycling Mexico City as @AliciaMstt.

I decided to write this entry while reading the Results of the Molecular Ecologist’s Survey on High-Throughput Sequencing, because it stated that 89% (n=260) of molecular ecologists working with High-Throughput Sequencing are performing the bioinformatic analyses themselves. I could not think of a better place to share a tool that I think anyone performing bioinformatic analyses should know: Docker. I will explain what Docker is in a moment, but first let me state why I think we all should turn our eyes to it.

I suspect most of that 89% have little previous training on computer sciences. At least I hadn’t when I jumped into using ddRAD for my PhD, my PhD friends were in a similar situation and my students are. There are tons of papers out there presenting cool biological results out of genomic data, so we must be a clever lot capable of learning how to perform bioinformatic analyses. If you learned, or are learning, bioinformatics then you likely know that the first challenge is not understanding how to use a command line program, but actually installing the damn thing and all its (never-ending) dependencies (maybe you have access to a HPCC and the cluster admin does that for you, but still, you end up having to install some stuff in your personal or lab computer). If so you likely also know that installing something can mess up something else. You may have left a Linux computer out of service for a couple of panicking days, you may had have to perform a fresh install of your Mac’s OS (or you want to, but that would mean figuring out again the installation of that precious software it cost you so much to get running). As if this were not enough, just yesterday they released a new version of that software you already have installed, and you would like to upgrade, if only you were not afraid of sharks:

https://xkcd.com/349/

Success. (xkcd)

The solution to all this comes in the shape of a nice blue whale called Docker:

Docker_(container_engine)_logo

Continue reading

Posted in bioinformatics, software | Tagged , | Leave a comment

What’s all the buzz about? Bees got microbiomes too!

So I know we are all blabbing about the human microbiome, who isn’t fascinated by the impressive roles tiny microbes have in our lives!? Trying to unravel what exactly our microbial communities do for us, and how they relate to our health is a pretty interesting challenge that we will (…I’m optimistic) figure out. However, we aren’t the only ones on the face of this beautiful planet that matter, as I’m pretty sure everyone knows.

Figure 1. Kwong & Moran (2016) Our microbial symbionts do a bunch of important things, and it’s become clear that not only is their part in our lives (and other organism’s lives) essential, but also not well understood and quite variable among different species. It definitely doesn’t help that we have to rely on new and fancy sequencing techniques just to begin the process of characterizing the bugs that have such a big influence in our lives.

 

Figure 2. Kwong & Moran (2016)

There are plenty of actual bugs (not bacteria) that play a big part in making our planet keep going, in particular, the honey bee (Apis mellifera) is an ideal model for studying host specific microbial communities. These bees exist in large colonies mainly made up  of female workers and reproductive queens. Interestingly, as summarized by a recent review in Nature this week by Kwong and Moran, the worker bees have a gut microbiota made up of just 9 species clusters. Crazy enough, we can culture all of the main microbiota that live in bee guts, while it is pretty tough to culture the majority of mammalian gut microbes.

It is often the case that in order to unravel more complex systems, it’s best to start with simpler versions. Honey bees transmit gut microbiomes, they are generally characterized by communities that have been adapted specifically to their hosts and they also grow best under oxygen concentrations lower than that found in the air (just like our microbiomes!). In contrast to humans, however, bee guts harbor about 9 species clusters and manipulation of these microbial communities is pretty darned easy. Granted, these groups are defined at the 97% operational taxonomic unit (OTU) level (using 16S rRNA gene sequences), and studies have found that this is maybe not always the best method to define species……regardless, even at this level, it seems that there is more of a diverse party going on in our guts vs bees.

As is often the case, recent sequencing tools have made the analysis of these model organisms better and more affordable. Checking out a simpler system than our own might allow for more concrete conclusions to be drawn regarding which microbiota are associated with specific hosts and ecotypes, allowing us to paint a beautiful picture of the adaptations that microbial communities have developed to thrive in a unique and distinct niche. This could be an important step in understanding the link between phenotype and genotype, ultimately elucidating the nucleotide changes responsible for specific physiological abilities.

This recent review is interesting for a variety of reasons, in particular, honey bees are  essential to all of us all over the globe. They are key in pollinating essential crops, and maybe the recent decrease in bee populations, which has been in the news lately, will allow us to figure out how to improve bee health as well as drive interest in clarifying symbiotic relationships that are important for all of us to understand. What a beeautiful system to study (sorry, had to do it).

Reference:

Kwong, Waldan K., and Nancy A. Moran. “Gut microbial communities of social bees.” Nature Reviews Microbiology (2016).

Posted in Coevolution, community ecology, evolution, genomics, metagenomics, microbiology | Tagged , , | Leave a comment

RADseq and missing data: some considerations

Unlike Sanger sequencing, where loci are directly targeted for each individual and sequencing errors are relatively rare, massively multilocus datasets from next generation sequencing platforms are characterized by large amounts of missing data. This is particularly true for restriction digest based (RADseq) approaches, where data are lost at every stage from the lab bench to the computer (Figure 1). During RADseq library preparation, for instance, mutations at cutsites may directly generate null alleles, or newly-mutated cutsites within loci may reduce fragment size and thus cause allelic dropout when these fragments are lost during size selection. During sequencing, the random allocation of a finite number of reads across numerous loci and individuals results in discrepancies in coverage. And during data processing and assembly, decisions about the total number of variant sites allowed (sequence identity) and minimum number of reads required to properly genotype each individual for a locus (coverage threshold) further prune your matrix.

Figure 1 from Huang and Knowles (2016), highlight the origin of missing data from library preparation to sequence assembly.

Figure 1 from Huang and Knowles (2016), highlighting the origin of missing data during steps from library preparation to sequence assembly.

Which means if you have a pile of RADseq data — as most of us do, these days — it’s necessarily going to be patchy. But what are the effects of missing data on inferences, and how should they be handled to best reduce their biases? Though major questions remain, the following four studies offer some insight.

Allelic dropout results in overestimates of genetic variation: An obvious starting point to understanding the implications of missing RADseq data is exploring how allelic dropout (ADO) from mutations at cutsites affect population genetic inferences. In a 2012 paper in Molecular Ecology, Matthew Gautier and colleagues asked just that, mathematically deriving the influence of ADO on allele frequencies and using a RADseq dataset simulated under a coalescent model to explore its influence on expected heterozygosity and FST. Their major takeaways: the frequency of ADO depends on mutation rate and effective population size, and ADO results in overestimates genetic variation both within and between populations. What to do about it? Gautier et al. suggest that a “practical solution might consist in detecting and removing the RAD loci characterized by high ADO frequency (say with fr =0.5).”

Continue reading

Posted in bioinformatics, genomics, methods, Molecular Ecology, the journal, next generation sequencing, phylogenetics, population genetics, theory | Tagged , , , | 3 Comments

On Integrative Species Delimitation…

Accurate delimitation of species is a fundamental first step that underlies much of what we do in biology. But this can prove challenging in many situations. Why? Let me count the ways. Incomplete lineage sorting, hybridization, morphological conservatism, and niche conservatism, to name a few. Of course, access to complete sampling from all OTUs across their geographic ranges is very often an issue as well.

Furthermore, consider the fact that, for well-studied faunas and floras, we desire to illuminate species boundaries in the hardest-to-delimit clades. That is, most of the clearest species boundaries have been identified already. Thus, when applying delimitation methods to modern empirical data, we are judging them based on their performance on the most recalcitrant of datasets.

Given that ambiguity can exist for species boundaries in multiple types of data, a holistic approach to species delimitation makes good sense. Among the plethora of delimitation methods described over the past 7 years or so, several accommodate data from gene- or species trees, as well as phenotypes and even geography (e.g., see here and here). You can see Melissa DeBiasse’s review of one of these methods (iBPP; Solís‐Lemus et al. 2015) from last year.

An interesting philosophical aspect of holistic delimitation methods is their formal integration of phenotypic data with our modern, coalescent-based framework for analyzing multilocus data. Phenotypic data have a long (and ongoing) history in systematics. However, they have occasionally been eschewed in phylogeny reconstruction, based in part on the difficulty of modeling evolution of some morphological characters.

No recent delimitation methods seek to model morphological evolution in a way analogous to the way we model molecular evolution. Instead, they rely (variously) on modeling trait variances within- and among species. In iBPP specifically, these variances are assumed a priori to have resulted from a Brownian motion model of evolution, which may or may not be accurate for the trait(s) under consideration. Indeed, many morphological traits fail to conform to strict Brownian expectations, which is a reminder of the problems associated with modeling trait evolution. Still, the authors of iBPP suggest their method is somewhat robust to violations of a Brownian model.

Have holistic methods such as these become a gold standard for species delimitation? The jury is still out. Answering that question will require further assessment of the methods across a greater diversity of clades and in more complex speciation scenarios. It will also depend on the extent to which current approaches to modeling trait evolution actually capture the dynamics of that process.

 

Solís‐Lemus, C., Knowles, L. L., & Ané, C. (2015). Bayesian species delimitation combining multiple genes and traits in a unified framework. Evolution 69:492-507.

Posted in evolution, methods, phylogeography, population genetics, software, species delimitation | Leave a comment

Signatures of the reproductive lottery

In marine populations, effective population sizes are usually several orders of magnitude lower than the census size. This difference is thought to be driven by

high fecundity, variation in reproductive success and pronounced early mortality, resulting in genetic drift across generations.

In other words, the adults who are the ones reproducing are only a fraction of the total population. Low effective to census population size ratios are one of the key predictions of the ‘sweepstakes reproductive success’ (SRS). Yet, in the marine environment, the different methods and predictions used to test this hypothesis have resulted in conflicting outcomes.

One way to resolve discrepancies in testing SRS is to use temporal sampling. Riquet and colleagues from the Station Biologique de Roscoff used the marine invasive gastropod Crepidula fornicata as a model to test SRS in a new paper in Heredity.

A stack of Crepdiula © Sergej Olenin

A stack of Crepdiula © Sergej
Olenin

They followed the annual recruitment of Crepidula for nine consecutive years in the Bay of Morlaix in Brittany, France. Genetic diversity varied, in part, due to the fluctuations in recruitment intensity, but also attributed to nonrandom differences in reproductive success across the years.

There were strong departures from HWE that were not attributed to null alleles, but rather to a temporal Wahlund effect.

A temporal Wahlund effect can arise from the juxtaposition of several groups with different allele frequencies, that is, offspring from different families.

Temporal genetic variation and a reduced effective population size are both signatures of a reproductive lottery, but the genetic drift is weak in C. fornicata relative to other marine species. This could be due to particular life history attributes of this invasive gastropod which may play an important role in buffering genetic drift.

References

F Riquet, S Le Cam, E Fonteneau, F Viard. Moderate genetic drift is driven by extreme recruitment events in the invasive mollusk Crepidula fornicata. Heredity doi: 10.1038/hdy.2016.24

Posted in evolution, natural history, population genetics, selection | Tagged , , , , | Leave a comment