Docker: making our bioinformatics easier and more reproducible

This is a guest post by Alicia Mastretta-Yanes, a CONACYT Research Fellow assigned to CONABIO, Mexico. Her research uses molecular ecology and genomic tools to examine the effect of changes on species distributions due to historical climate fluctuations as well as the effect of human management and domestication. You can find more about her research in her website: http://mastrettayanes-lab.org. She tweets about reproducible research, genomics and cycling Mexico City as @AliciaMstt.
I decided to write this entry while reading the Results of the Molecular Ecologist’s Survey on High-Throughput Sequencing, because it stated that 89% (n=260) of molecular ecologists working with High-Throughput Sequencing are performing the bioinformatic analyses themselves. I could not think of a better place to share a tool that I think anyone performing bioinformatic analyses should know: Docker. I will explain what Docker is in a moment, but first let me state why I think we all should turn our eyes to it.
I suspect most of that 89% have little previous training on computer sciences. At least I hadn’t when I jumped into using ddRAD for my PhD, my PhD friends were in a similar situation and my students are. There are tons of papers out there presenting cool biological results out of genomic data, so we must be a clever lot capable of learning how to perform bioinformatic analyses. If you learned, or are learning, bioinformatics then you likely know that the first challenge is not understanding how to use a command line program, but actually installing the damn thing and all its (never-ending) dependencies (maybe you have access to a HPCC and the cluster admin does that for you, but still, you end up having to install some stuff in your personal or lab computer). If so you likely also know that installing something can mess up something else. You may have left a Linux computer out of service for a couple of panicking days, you may had have to perform a fresh install of your Mac’s OS (or you want to, but that would mean figuring out again the installation of that precious software it cost you so much to get running). As if this were not enough, just yesterday they released a new version of that software you already have installed, and you would like to upgrade, if only you were not afraid of sharks:

https://xkcd.com/349/

Success. (xkcd)


The solution to all this comes in the shape of a nice blue whale called Docker:
Docker_(container_engine)_logo
Continue reading

Posted in bioinformatics, software | Tagged , | 1 Comment

What's all the buzz about? Bees got microbiomes too!

So I know we are all blabbing about the human microbiome, who isn’t fascinated by the impressive roles tiny microbes have in our lives!? Trying to unravel what exactly our microbial communities do for us, and how they relate to our health is a pretty interesting challenge that we will (…I’m optimistic) figure out. However, we aren’t the only ones on the face of this beautiful planet that matter, as I’m pretty sure everyone knows.
Figure 1. Kwong & Moran (2016)
Our microbial symbionts do a bunch of important things, and it’s become clear that not only is their part in our lives (and other organism’s lives) essential, but also not well understood and quite variable among different species. It definitely doesn’t help that we have to rely on new and fancy sequencing techniques just to begin the process of characterizing the bugs that have such a big influence in our lives.
 
Figure 2. Kwong & Moran (2016)
There are plenty of actual bugs (not bacteria) that play a big part in making our planet keep going, in particular, the honey bee (Apis mellifera) is an ideal model for studying host specific microbial communities. These bees exist in large colonies mainly made up  of female workers and reproductive queens. Interestingly, as summarized by a recent review in Nature this week by Kwong and Moran, the worker bees have a gut microbiota made up of just 9 species clusters. Crazy enough, we can culture all of the main microbiota that live in bee guts, while it is pretty tough to culture the majority of mammalian gut microbes.
It is often the case that in order to unravel more complex systems, it’s best to start with simpler versions. Honey bees transmit gut microbiomes, they are generally characterized by communities that have been adapted specifically to their hosts and they also grow best under oxygen concentrations lower than that found in the air (just like our microbiomes!). In contrast to humans, however, bee guts harbor about 9 species clusters and manipulation of these microbial communities is pretty darned easy. Granted, these groups are defined at the 97% operational taxonomic unit (OTU) level (using 16S rRNA gene sequences), and studies have found that this is maybe not always the best method to define species……regardless, even at this level, it seems that there is more of a diverse party going on in our guts vs bees.
As is often the case, recent sequencing tools have made the analysis of these model organisms better and more affordable. Checking out a simpler system than our own might allow for more concrete conclusions to be drawn regarding which microbiota are associated with specific hosts and ecotypes, allowing us to paint a beautiful picture of the adaptations that microbial communities have developed to thrive in a unique and distinct niche. This could be an important step in understanding the link between phenotype and genotype, ultimately elucidating the nucleotide changes responsible for specific physiological abilities.
This recent review is interesting for a variety of reasons, in particular, honey bees are  essential to all of us all over the globe. They are key in pollinating essential crops, and maybe the recent decrease in bee populations, which has been in the news lately, will allow us to figure out how to improve bee health as well as drive interest in clarifying symbiotic relationships that are important for all of us to understand. What a beeautiful system to study (sorry, had to do it).
Reference:
Kwong, Waldan K., and Nancy A. Moran. “Gut microbial communities of social bees.” Nature Reviews Microbiology (2016).

Posted in Coevolution, community ecology, evolution, genomics, metagenomics, microbiology | Tagged , , | 2 Comments

RADseq and missing data: some considerations

Figure 1 from Huang and Knowles (2016), highlight the origin of missing data from library preparation to sequence assembly.
Figure 1 from Huang and Knowles (2016), highlighting the origin of missing data during steps from library preparation to sequence assembly.

Unlike Sanger sequencing, where loci are directly targeted for each individual and sequencing errors are relatively rare, massively multilocus datasets from next generation sequencing platforms are characterized by large amounts of missing data. This is particularly true for restriction digest based (RADseq) approaches, where data are lost at every stage from the lab bench to the computer (Figure 1).

During RADseq library preparation, for instance, mutations at cutsites may directly generate null alleles, or newly-mutated cutsites within loci may reduce fragment size and thus cause allelic dropout when these fragments are lost during size selection. During sequencing, the random allocation of a finite number of reads across numerous loci and individuals results in discrepancies in coverage. And during data processing and assembly, decisions about the total number of variant sites allowed (sequence identity) and minimum number of reads required to properly genotype each individual for a locus (coverage threshold) further prune your matrix.

Which means if you have a pile of RADseq data — as most of us do, these days — it’s necessarily going to be patchy. But what are the effects of missing data on inferences, and how should they be handled to best reduce their biases? Though major questions remain, the following four studies offer some insight.

Continue reading
Posted in bioinformatics, genomics, methods, Molecular Ecology, the journal, next generation sequencing, phylogenetics, population genetics, theory | Tagged , , , | 6 Comments

On Integrative Species Delimitation…

Accurate delimitation of species is a fundamental first step that underlies much of what we do in biology. But this can prove challenging in many situations. Why? Let me count the ways. Incomplete lineage sorting, hybridization, morphological conservatism, and niche conservatism, to name a few. Of course, access to complete sampling from all OTUs across their geographic ranges is very often an issue as well.
Furthermore, consider the fact that, for well-studied faunas and floras, we desire to illuminate species boundaries in the hardest-to-delimit clades. That is, most of the clearest species boundaries have been identified already. Thus, when applying delimitation methods to modern empirical data, we are judging them based on their performance on the most recalcitrant of datasets.
Given that ambiguity can exist for species boundaries in multiple types of data, a holistic approach to species delimitation makes good sense. Among the plethora of delimitation methods described over the past 7 years or so, several accommodate data from gene- or species trees, as well as phenotypes and even geography (e.g., see here and here). You can see Melissa DeBiasse’s review of one of these methods (iBPP; Solís‐Lemus et al. 2015) from last year.
An interesting philosophical aspect of holistic delimitation methods is their formal integration of phenotypic data with our modern, coalescent-based framework for analyzing multilocus data. Phenotypic data have a long (and ongoing) history in systematics. However, they have occasionally been eschewed in phylogeny reconstruction, based in part on the difficulty of modeling evolution of some morphological characters.
No recent delimitation methods seek to model morphological evolution in a way analogous to the way we model molecular evolution. Instead, they rely (variously) on modeling trait variances within- and among species. In iBPP specifically, these variances are assumed a priori to have resulted from a Brownian motion model of evolution, which may or may not be accurate for the trait(s) under consideration. Indeed, many morphological traits fail to conform to strict Brownian expectations, which is a reminder of the problems associated with modeling trait evolution. Still, the authors of iBPP suggest their method is somewhat robust to violations of a Brownian model.
Have holistic methods such as these become a gold standard for species delimitation? The jury is still out. Answering that question will require further assessment of the methods across a greater diversity of clades and in more complex speciation scenarios. It will also depend on the extent to which current approaches to modeling trait evolution actually capture the dynamics of that process.
 
Solís‐Lemus, C., Knowles, L. L., & Ané, C. (2015). Bayesian species delimitation combining multiple genes and traits in a unified framework. Evolution 69:492-507.

Posted in evolution, methods, phylogeography, population genetics, software, species delimitation | 1 Comment

Signatures of the reproductive lottery

In marine populations, effective population sizes are usually several orders of magnitude lower than the census size. This difference is thought to be driven by

high fecundity, variation in reproductive success and pronounced early mortality, resulting in genetic drift across generations.

In other words, the adults who are the ones reproducing are only a fraction of the total population. Low effective to census population size ratios are one of the key predictions of the ‘sweepstakes reproductive success’ (SRS). Yet, in the marine environment, the different methods and predictions used to test this hypothesis have resulted in conflicting outcomes.
One way to resolve discrepancies in testing SRS is to use temporal sampling. Riquet and colleagues from the Station Biologique de Roscoff used the marine invasive gastropod Crepidula fornicata as a model to test SRS in a new paper in Heredity.

A stack of Crepdiula © Sergej Olenin

A stack of Crepdiula © Sergej
Olenin


They followed the annual recruitment of Crepidula for nine consecutive years in the Bay of Morlaix in Brittany, France. Genetic diversity varied, in part, due to the fluctuations in recruitment intensity, but also attributed to nonrandom differences in reproductive success across the years.
There were strong departures from HWE that were not attributed to null alleles, but rather to a temporal Wahlund effect.

A temporal Wahlund effect can arise from the juxtaposition of several groups with different allele frequencies, that is, offspring from different families.

Temporal genetic variation and a reduced effective population size are both signatures of a reproductive lottery, but the genetic drift is weak in C. fornicata relative to other marine species. This could be due to particular life history attributes of this invasive gastropod which may play an important role in buffering genetic drift.
References
F Riquet, S Le Cam, E Fonteneau, F Viard. Moderate genetic drift is driven by extreme recruitment events in the invasive mollusk Crepidula fornicata. Heredity doi: 10.1038/hdy.2016.24

Posted in evolution, natural history, population genetics, selection | Tagged , , , , | Leave a comment

What does the island fox say?

Small populations are characterized by large drift and reduced efficacy of selection effects, which result in fixation of both advantageous and deleterious alleles, accumulation of homozygosity, and often reduction in population fitness. What with plummeting mammal populations across biota, understanding the genomic basis of this dearth in diversity is key to developing informed conservation programs. With this goal in mind, Robinson et al. (2016) sought to estimate levels of genomic diversity in isolated populations of the endangered Island Fox, Urocyon littoralis in six of California’s Channel Islands.

An island fox pup. Image courtesy: NPS (https://goo.gl/1V6tKS)

An island fox pup. Image courtesy: NPS (https://www.nps.gov/media/photo/gallery.htm?id=EF47A6A7-155D-4519-3E2D59C337F5F96F)


Using newly sequenced genomes from seven representative foxes from the islands and a mainland gray fox from Southern California, they determined that the San Nicolas individuals are nearly identical, with an extreme reduction in genome-wide heterozygosity (3-84 fold), compared to the mainland gray fox. Other island populations exhibited similar reductions in heterozygosity. However, the smallest census population in San Miguel yet comprised foxes harboring greater diversity than the San Nicolas population. To model this, the authors use ABC to simulate data under three scenarios – a population with no demographic changes (sensu mainland foxes), one with an older bottleneck (similar to the San Miguel foxes), and one with a very recent extreme bottleneck (similar to San Nicolas). Estimated effective population sizes under these models were consistent with true estimates, indicating support for the effects of small population sizes (in San Miguel), and recent bottlenecks (in San Nicolas).
Additional characterization of the types of mutations in homozygosity accumulated regions in the islands revealed (a) general increase in loss of function alleles/genotypes, (b) enhancement of olfactory receptor genes in ancestrally heterozygosity rich regions, primarily owing to demographic effects (as confirmed by simulations), and not due to balancing selection. This study puts declining populations in question, which are long thought to be on an evolutionary spiral down to extinction owing to increased genetic load.

The absence of obvious negative effects on population persistence from genetic deterioration may in part reflect a more benign island environment, given the lack of competitors and predators that exist on the mainland…Notably, our results contradict the notion that long-term small effective population size and inbreeding on the islands have enhanced purging and decreased their genetic load.

Reference:
Robinson, Jacqueline A., et al. “Genomic Flatlining in the Endangered Island Fox.” Current Biology (2016). DOI: 10.1016/j.cub.2016.02.062

Posted in adaptation, evolution, genomics, mutation, natural history, population genetics, selection | Tagged , , , | 1 Comment

A new (quantitative!) method for comparative phylogeography

"I reckon the Rio Juruá has something to do with this widespread phylogeographic pattern!" From Avise (2000)

“I reckon the Rio Juruá has something to do with this widespread phylogeographic pattern!”* From Avise (2000) *not a direct quote


Comparative phylogeographic studies usually involve a) documenting a phylogeographic pattern and b) recognizing that the same pattern is congruent in multiple species.
But what if species histories are only sortof congruent? Perhaps they share one major splitting event but not later events. Or maybe the phylogenies are topologically congruent but on very different timescales. It would be great to measure the degree of phylogeographic discordance among species.
Hickerson et al. (2010), in their review of “Phylogeography’s past, present, and future” said:

“A key challenge for comparative phylogeography is the need for developing analytical tools that can be used to evaluate spatial and temporal congruence or incongruence in phylogeographic patterns across multiple species.”

Satler and Carstens (2016) have answered this call. They present the Phylogeographic Concordance Factor (PCF), a new metric for quantifying the phylogeographic concordance (or discordance) of several codistributed species.
Continue reading

Posted in Coevolution, phylogeography, plants, software | Tagged , | 5 Comments

Disentangling the wolf-coyote admixture through an ancestry-based approach

Coyote (Canis latrans). Source: Wikimedia Commons/Christopher Bruno, http://www.sxc.hu

Coyote (Canis latrans). Source: Wikimedia Commons/Christopher Bruno, http://www.sxc.hu


Large carnivores like bears and wolves still pose a puzzle for systematics and population genetics. The more data we get, the more complex their evolutionary history seems to be.
Continue reading

Posted in conservation, evolution, genomics, population genetics | Tagged , , , | Leave a comment

Analysis of the human microbiome reveals you are (at least related to) what you eat, in a manner of speaking

Science3

Understanding microbial symbioses, and more specifically how the human microbiome affects our health, is currently a hot topic in the land of microbiology and metagenomics. The most recent special edition of Science focuses on reviews and articles centered on understanding the fundamental relationships between us and our most closely associated microbes.

Ever think that that group of people who thinks milk chocolate >> dark chocolate was a little special, well turns out they might be different for more than just that obvious reason. Some recent studies just out today have revealed that variation in the human microbiome can also be linked to differences in other food preferences.

Falony et al., 2016 / Figure 5. Drug interactions in the FGFP

Still, while we have a long way to go before we understand the significance in the variance of microbial communities a couple of articles just released in Science are some of the most extensive studies published to date on the human microbiome. The overarching goal of these impressive analyses was to try to understand what’s up with the microbes living in the large intestines of healthy individuals.

One of the studies by Falony et al., (2016) included a survey of 3,948 northern Europeans, and presents a wealth of data demonstrating that we have a long way to go before we unravel the secrets that our microbiota want to tell us…or at least what their genomes have say. All of this data might someday lead us to figuring out how we can enhance our own health by switching stuff up with our gut microbes, or how different drugs might affect different people (depending on microbial community composition).

Zhernakova et al., 2016 / Figure 2. Interindividual variation of microbial composition and function profile

There are so many variables to account for (different genetic backgrounds, ages, diets, not to mention when samples are taken post meal time…) that there’s a long way before we have a completely exhaustive dataset (is that even possible to attain??), which might be essential to figuring out health-linked stuff related to our microbiome. The other study by Zhernakova et al., (2016) looked at 1,135 Dutch individuals also demonstrated that there’s a lot we don’t know, since diversity in only 19% of the variation in the microbiome could be explained.

All of this data has led us to a bit of a chicken vs. egg situation, how much of our microbiome is influenced by genetics? or our diet? One thing is clear, affordable next-generation sequencing and the relatively recent understanding of just how essential the role our microbiome is in relation to our health will ensure that we’ll be studying our own bugs for many years to come.

References

ALEXANDRA ZHERNAKOVA, ALEXANDER KURILSHIKOV, MARC JAN BONDER, ETTJE F. TIGCHELAAR, MELANIE SCHIRMER, TOMMI VATANEN, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 352: 6285 (2016): 565-569. DOI: 10.1126/science.aad3369

 GWEN FALONY, MARIE JOOSSENS, SARA VIEIRA-SILVA, JUN WANG, YOUSSEF DARZI, KAROLINE FAUST, et al. Population-level analysis of gut microbiome variation. Science. 352: 6285 (2016): 560-564. DOI: 10.1126/science.aad3503

Posted in genomics, medicine, microbiology, next generation sequencing, population genetics | Leave a comment

The slow, and sometimes incomplete, journey to diploidy

Whether you are reading this as a plant, an animal, or fungus, it is likely that some ancestor of yours doubled up on genomes. However, it is likely that these extra genomes disappeared over evolutionary time. What gives? Where are those extra genomes that I should have rightfully inherited?


Diploidization, the mysterious process that reigns extra genomes back to a diploid state, has been a vexing complexity for those who are trying to piece together the evolutionary history of ancient polyploids. For example, diploidization appears to happen at different rates at different chromosomes/loci in fish and maize.
So there are two outstanding questions. First, what taxa have undergone paleopolyploidy events? Second, how did they get back to diploidy?
 
The gradual descent into diploidy
One of the more recent whole-genome duplication events occurred in Salmonid fishes ~80 million years ago, making it a group of interest for understanding some long term evolutionary consequences of diploidization while still having enough genomic resolution to actually detect those consequences. In a recent issue of Nature, Lien et al. characterize the Atlantic Salmon genome in an attempt to document the ongoing process of diploidization in this species.
Indeed, Atlantic Salmon are still returning to diploidy:

Without exception, duplicated regions exhibiting rearrangements at telomeres in the form of inversions, translocations or larger deletions all displayed a sequence similarity of ∼87%. This clear correspondence between the degree of intra-block sequence similarity and blocks predicted to still participate in tetrasomic inheritance (or recently have done so) suggests that up to 25% of the salmon genome experienced delayed rediploidization after the initial large chromosome rearrangements, and that as much as 10% of the genome may still retain residual tetrasomy

From Figure 3c of Lien et al. (2016), displaying a hypothetical model of post genome duplication (Ss4R) rediploidization.

From Figure 3c of Lien et al. (2016), displaying a hypothetical model of post genome duplication (Ss4R) rediploidization in Atlantic Salmon.


During this process of diploidization, duplicated genes that are nonfunctional are often lost. Those functional duplicates that stick around can be the result of neofunctionalization, where one duplicate acquires a new function compared to the other, or subfunctionalization, where each duplicate retains only one part of the function from their ancestral gene. Lien et al. suggest more instances of neofunctionalization in Atlantic Salmon compared to subfunctionalization.

The predominance of cases where only one copy has changed its regulation compared to the assumed ancestral state indicates that regulatory subfunctionalization has not been a dominant duplicate retention mechanism post [genome duplication event], unless it was followed by subsequent neofunctionalization, which has been suggested as a common process.

 
When diploidization gets odd
Where the Atlantic Salmon may be steadily becoming diploid while retaining genes with new functions, another recent publication highlights a taxon in which diploidization got…odd.
The heartleaf bittercress (Cardamine cordifolia) is a widespread and ecologically-successful flowering plant in Western North America that happens to be triploid. This scenario is unusual because other triploid relatives are sterile. What makes C. cordifolia so special?

"Chromosome painting"

“Chromosome painting” is a technique to visualize happy little chromosomes using in situ hybridization


Mandakova et al. used chromosome painting to investigate the paradoxical genome number in C. cordifolia, and it turns out that the chromosome counts of C. cordifolia were not what they seemed. Due to four separate chromosome translocations, the ancestral tetraploidy of C. cordifolia has been reduced to (pseudo)triploidy in this species:

…the pseudotriploid genome of C. cordifolia originated through diploidization of a primary tetraploid ancestral genome. Hence, C. cordifolia , while being a functionally diploid species, arose from a tetraploid genome. The extant genome of C. cordifolia originated from its tetraploid progenitor through descending dysploidy, whereby the origin of four translocation (“fusion”) chromosomes reduced the original number of linkage groups from 16 to 12.

The authors justifiably conclude that chromosome counts can be misleading when interpreting the evolutionary histories of polyploid species, especially when “diploidization” doesn’t result in a diploid at all.
 
Cited
Lien, S., Koop, B. F., Sandve, S. R., Miller, J. R., Kent, M. P., Nome, T., … & Grammes, F. (2016). The Atlantic salmon genome provides insights into rediploidization. Nature. doi:10.1038/nature17164
Mandáková, T., Gloss, A. D., Whiteman, N. K., & Lysak, M. A. (2016). How diploidization turned a tetraploid into a pseudotriploid. American Journal of Botany. doi:10.3732/ajb.1500452

Posted in evolution, genomics, quantitative genetics, speciation | Tagged , , | 1 Comment