The Tao of open science for ecology

I think we can all agree that science needs to be transparent, shared, and reproducible. Recently, however, the discussion about “open science” has been conducted mostly in online forums and less so in publications (hopefully Open Access ones!). This is why Hampton et al decided to publish their idea for the path for open science in ecology for all to see – even those who are less active on social media. Continue reading

RedditDiggMendeleyPocketShare and Enjoy
Posted in science publishing, Uncategorized | Tagged , | Leave a comment

Current archival practices limit our ability to reuse genetic data

Slide1Archiving genetic data is important for a lot of reasons, like ensuring reproducibility and transparency of results. Being able to access previously published data is also important given that the same set of data can often help answer a diversity of relevant questions in the field of evolutionary biology. In the current issue of Molecular Ecology, Pope et al. analyzed 419 data sets from 289 articles published in the journal over the last 5 years, recording the extent to which the data sets could be recreated given the geographic and temporal provided by the authors. For example, for sequences collected across a geographic range, could Pope et al. determine which sequences were collected in which areas? If only unique sequences were uploaded to Genbank by the original authors, was information needed to figure out the number of individuals from a given location that had a particular sequence also provided (i.e. sample sizes and haplotype/allele frequencies)? Did the authors report the timeframe in which they collected the samples?

Pope et al. found that since the 2011 implementation of the Joint Data Archiving Policy (JDAP), which requires that data supporting publications be made publicly available, the archiving of genetic data increased from 49% (pre-2011) to 98% (2011-today). To me, uploading genetic data to a curated database like Genbank or the European Nucleotide Archive feels as much as part of the process as does writing the paper. Unfortunately, Pope et al. were unable to recreate 31% of the archived data sets they downloaded based on the information provided in the paper or with the sequence data themselves. Over a third of articles provided geographic information as text only without including geographic coordinates and 18% of those described sampling at the broader regional scale. About 40% of the articles provided no temporal information and 20% reported only a range of years.

While great progress has been made towards the public availability of genetic data, the lack of emphasis on provision of associated information, such as geographic location and time of sampling, may impede our ability to fully reproduce such studies or use their genetic data in new ways.

Pope et al. recommended that in order to make genetic data truly accessible and useful for future analyses, at a minimum, individual genotypes should be recoverable and linked to geographic and temporal information. The authors also suggested including a readme file with the archived data that provides relevant information, like the naming/coding system used to identify sequences generated in the study.

To fully realize the future potential of this data legacy, there should now be a greater push to link spatio-temporal metadata to genetic data and to develop standards and repositories that facilitate data deposition, curation and searchability.


Pope, L. C., Liggins, L., Keyse, J., Carvalho, S. B., & Riginos, C. (2015). Not the time or the place: the missing spatio‐temporal link in publicly available genetic data. Molecular Ecology (24) 3802–3809. DOI: 10.1111/mec.13254

Posted in Uncategorized | Leave a comment

Who came first – the Paleo- or Native American?

In yet another infamous Science vs Nature race, two studies published this Tuesday toss more cans of worms at the ongoing debate about the founding of the Americas – with disparate findings. Uh oh.

Representatives of six native American tribes bury the remains of Anzick-I. Image courtesy:

Skoglund et al. Nature (2015) Genetic evidence for two founding populations of the Americas

In further evidence for what’s come to be known as the Paleoamerican model, Skoglund et al. (2015) analyzed genomic ancestries of 63 individuals in 21 Native American populations with little evidence of European or African ancestries at 600,000 SNP’s by computing f4 statistics, and reject the null hypothesis that Native Americans descend from one single homogenous population after divergence from other discernible distinct populations across the world. Native Americans also cluster with Amazonian, Mesoamerican, Australasian, and other Pacific island populations. Further analyses also indicate the possibilities of (a) Amazonians descending from an ancestor of Anadamanese and other Australasian populations, perhaps more plausibly, (b) ancestral admixture of Amazonians and ancestors of Native Americans, termed the population “Y”. While questions remain about how the “Y” populations migrated into South America, this study warrants genomic analyses of more ancient remains to fill up the blanks.

Raghavan et al. Science (2015) Genomic evidence for the Pleistocene and recent population history of Native Americans

Raghavan et al. (2015) analyze whole genome sequences of 31 present day individuals from the Americas, Siberia, and Oceania (with a similar sampling strategy as Skoglund et al. (2015)), 23 ancient genomes from the Americas, and SNP genotypes from 79 individuals from the Americas and Siberia. Admixture analyses indicate structuring of all Native Americans into one cluster (at K=4), indicating common ancestry of all Native Americans. At K=15, however, some Native American individuals are indicative of shared ancestry with Anzick-1 (from the Clovis site), with others clustering with Siberians, further ascertained by admixture graph analyses. Estimation of time of divergence between Native Americans, Siberians, and Han Chinese indicated a unanimous splitting time of around ~23,000 ybp for both Native American groups. Analyses of SNP chip data however reveals a similar story as reported by Skoglund et al. (2015), indicative of an ancestral admixture event which resulted in Oceanic ancestry in some Native American populations, however purportedly more recent – particularly after the peopling of the Americas. Studying the ancient genomes also revealed no evidence of admixture of Oceanic populations into ancient American peoples, further indicating no support for the Paleoamerican model.


Skoglund et al. “Genetic evidence for two founding populations of the Americas.” Nature (2015) DOI:

Raghavan et al. “Genomic evidence for the Pleistocene and recent population history of Native Americans.” Science (2015) DOI:

Posted in genomics, next generation sequencing, Paleogenomics, population genetics | Tagged , , , , | 2 Comments

Dozens of talks from the Evolution 2015 meetings are on YouTube

If, like me, you didn’t make it to the 2015 Evolution meetings — maybe the logistics of a trip to Brazil were beyond your financial and/or temporal means — you can make up for it with the big cache of videos posted to the conference’s YouTube channel. This is the second year the joint annual meeting of the American Society of Naturalists, the Society of Systematic Biologists, and the Society for the Study of Evolution has taken video of research presentations (with the permission of the presenters), and it’s good to see the practice continuing.

There are many, many talks to peruse, but here’s just one that looks like it’ll be of interest to Molecular Ecologist readers: Diego F. Alvarado-Serrano proposing a new, spatially-oriented version of the site-frequency spectrum, that may help understand historical changes in species’ ranges.

Posted in community, conferences, phylogeography, population genetics | Tagged | 1 Comment

Dispersal and the rainbow trout takeover


I’m going to keep rolling on the dispersal theme from last week and share a new paper by Ryan Kovach and colleagues that demonstrates the balance between dispersal and selection. Specifically, the authors show that this balance dictates the hybridization between a native and invasive trout species.

The authors utilized data from two populations of cutthroat trout that spans 24 years in order to detect changes in rainbow trout ancestry and quantify associated phenotypic variation. In this case, the danger for cutthroat trout populations is very real: too much hybridization with rainbow trout can lead to a hybrid soup of genomes in which native genomes dissappear (Allendorf and Leary 1988).

Figure 1 from Kovach et al. (2015)

Figure 1 from Kovach et al. (2015) showing the relationship between rainbow trout (RBT) admixture and length (a) or early migration (b)

The identification of genetic introgression from rainbow trout increased dramatically from 1984 to 2003 (from 0% to 87% in one adult population!). And if you are a hybrid salmon, the more rainbow trout genes you can get, the better. As the proportion of rainbow trout alleles goes up, body size goes up and time until migration goes down: two factors strongly associated with fitness.

However, the proportion of rainbow trout alleles entering the cutthrout populations was much greater than the proportion of alleles leaving, indicating selection against hybrids. And the selection coeffiicients against these hybrids were strong to boot, up to 0.88!

This left Kovach et al. with a simple explanation: dispersal by rainbow/cutthroat hybrids plays a huge role in the increase of hybrids over the past 24 years.

Thus, our study shows that combining data on fitness and dispersal is necessary to fully understand the mechanisms driving invasive hybridization and other eco-evolutionary dynamics [59]; the paucity of such data in wild animal populations makes this a novel step forward in our empirical understanding of how invasive introgression can spread in natural populations.



Allendorf, F. W., & Leary, R. F. (1988). Conservation and distribution of genetic variation in a polytypic species, the cutthroat trout. Conservation Biology, 170-184.

Kovach, R. P., Muhlfeld, C. C., Boyer, M. C., Lowe, W. H., Allendorf, F. W., & Luikart, G. (2015). Dispersal and selection mediate hybridization between a native and invasive species. Proceedings of the Royal Society of London B: Biological Sciences, 282(1799), 20142454.

[59] above Lowe, W. H., & McPeek, M. A. (2014). Is dispersal neutral?. Trends in ecology & evolution, 29(8), 444-450.

Posted in adaptation | Tagged , , | Leave a comment

What do with all those pesky mtDNA reads in your NGS experiment

Have you ever noticed how many reads from your high throughput sequencing project map to the tiny fraction of your genome that is the mitochondrial genome (mtDNA)? Pretty much any NGS experiment (e.g., RNA-seq, DNA-seq, capture-based sequencing) leave you with ultra-deep coverage of mtDNA. But what do you do with them? The most common option is to ignore reads mapping to mtDNA. An even less common option is to turn them into a Science paper . But what if you want to do something with those reads and not publish it in Science? Continue reading

Posted in bioinformatics, genomics, howto, mutation, software, Uncategorized | Tagged , | Leave a comment

IMa2p – Parallel Isolation with Migration Analyses

I figured that it was time to write an update on my post from a year ago on Bayesian MCMC in inferring ancestral demography. Recently, my postdoctoral advisor, Jody Hey and I released a version of the popular IMa2 program, called “IMa2p” which extends all the functionalities of IMa2 (and more!) to run your divergence genomics runs faster than you could before. Here is a quick blurb from our recent paper where we describe the algorithms, and speedups in computation that IMa2p has to offer.

Speed-ups in computational time using IMa2p, using datasets of varying sizes. Image from Fig. 1 of Sethuraman and Hey (2015).

Speed-ups in computational time using IMa2p, using datasets of varying sizes. Image from Fig. 1 of Sethuraman and Hey (2015).

IMa2 (Hey and Nielsen 2007, and other programs in the IM suite) is a Bayesian MCMC based method that estimates ancestral demography (population mutation rates, divergence times, and migration rates) under an ‘Isolation with Migration’ (IM) model (Nielsen and Wakeley 2001). If you’ve used IMa2 (or any other Bayesian MCMC sampler) before, you would have also noticed that increasing the size of data (either number of genotyped loci, number of individuals, size of loci, number of populations, and correspondingly number of parameters) increases the computational time super-exponentially (also see Hey 2010). Larger data sets are also increasingly difficult to converge (see my earlier post on what this means), and computationally intensive. IMa2p is a parallelized (OpenMPI-C++) version of IMa2, which allows distribution of the MCMC step (also called the ‘M’ mode in IMa2 parlance) across multiple cores, and collating sampled genealogies across processors while performing estimation of posterior density distributions, and likelihood ratio tests (also called the ‘L’ mode).

In our paper, we report (a) increased linearity in computational speed improvement with increasing number of loci analyzed, (b) increased departure from linearity with high variance in computational time among loci (for eg. while using large priors on migration rates), and (c) consistency in estimates of posterior density distributions with varying number of processors/cores.

You can download IMa2p and instructions on installation and running it on my Git page here.

Good luck, and do write to me ( if you have any questions, queries, or to report bugs!

References: Sethuraman, Arun, and Jody Hey. “IMa2p–parallel MCMC and inference of ancient demography under the Isolation with migration (IM) model.” Molecular ecology resources (2015). DOI:

Nielsen, Rasmus, and John Wakeley. “Distinguishing migration from isolation: a Markov chain Monte Carlo approach.” Genetics 158.2 (2001): 885-896.

Hey, Jody. “Isolation with migration models for more than two populations.”Molecular biology and evolution 27.4 (2010): 905-920. DOI:

Posted in bioinformatics, genomics, howto, software, theory | Tagged , , , , , | Leave a comment

Dispersal by land or by sea

Here, we compare and contrast the traits and selective forces influencing the evolution of dispersal in marine and terrestrial systems. From this comparison, a unifying question emerges: when is dispersal for dispersal and when is dispersal a by-product of selection on traits with other functions?

Dispersal sometimes seems like one of the “big things” that gets lost in the present trajectory of molecular ecology. We know a lot about how dispersal varies between species, populations, and individuals, but it sure is a tricky set of parameters to include in most modern popgen analyses.

One (of many) reasons why dispersal and gene flow are so difficult to generalize is that the why and how can vary so greatly between organisms. Is that jellyfish “dispersing” or just floating around? Is that frog dispersing because of the density of conspecifics or some other reason?


“Wait, uhh, am I dispersing here or what?” (Image by Sanjay Acharya)

Burgess et al. recently published a review that tackles some of these issues and points out that their are big differences between marine and terrestrial dispersal (surprise!) that mostly get left out of theory. However, a bigger goal of the review is asking scientists to think harder about dispersal as a direct adaption or as a by-product of some other process, and they outline a multivariate model for getting started.

Figure 1 (B) of Burgess et al. (2015)

Figure 1 (B) of Burgess et al. (2015)

I’ll leave it up to how adaptationist you are for deciding when a by-product is actually an adaptation. I’m not here to get all spandrel-y. For now, its hard to argue against understanding the complexity that underlies dispersal, whether by land or by sea.

A trait-based approach, focused on selection on traits that influence dispersal, will not only improve our understanding of when dispersal is a direct adaptation versus a by-product, but can also advance the integration of theory and data. Theories of dispersal evolution would benefit from considering the evolutionary causes of movement in general as well as additional agents of selection on the multiple traits that influence dispersal specifically.


Burgess, S. C., Baskett, M. L., Grosberg, R. K., Morgan, S. G., & Strathmann, R. R. (2015). When is dispersal for dispersal? Unifying marine and terrestrial perspectives. Biological Reviews. DOI: 10.1111/brv.12198


Posted in population genetics | Tagged , , | Leave a comment

Raising the NIH pay-line to 20%

I bet that title got your attention.

In the good ol’ days our funding record made the United States look like the land of milk and honey. As Bruce Alberts’ and colleague wrote in PNAS earlier this year:

“The United States has traditionally been viewed as the land of opportunity for young scientists, offering the most talented of them the chance to test their own ideas, raise radically new questions, and forge original paths to the answers. This feature of our system has drawn many of our most able young people to scientific careers, while simultaneously attracting outstanding young people to the United States from around the world.”

Well, those days are no more. Now young investigators are 6 times less likely to win an NIH grant than they were 30 years ago:

Percentage of NIH R01 Principal Investigators aged 36 and younger and aged 66 and older, 1980–2010 (from:

So what can be done about it? As a junior researcher, I think about this issue a lot – mostly because I’m selfish and want to know when and where my next academic “meal” is coming from. So it pleases me to see that others (read: those with more clout) are organizing workshops to try and right the ship.

Continue reading

Posted in career, funding, NIH, politics, United States | Tagged , , , | Leave a comment

Genomics: the “four-headed beast” of Big Data

Big Data in the cloud. Photo from

Big Data in the cloud. Photo from

When I bought my first laptop in 2005, it came with a free 64MB flash drive*, which I thought was pretty awesome. Given the rate at which genomic data generation has increased in the past decade, the storage capacity of that flash drive is laughable today. In their new PLOS Biology paper, Stephens et al. talk about genomics as a Big Data science, compare it to other Big Data domains (Astronomy, YouTube, and Twitter, specifically), and project where genomics is headed in the next decade in terms of data acquisition, storage, distribution, and analysis. Continue reading

Posted in Uncategorized | Leave a comment