Geographical Heat Maps in R

I go crazy for fancy data visualizations in R, and a figure in a recent publication has had me wondering if there is an easy way to incorporate density distributions (or as in their case, a distribution of f4 statistics, that are often used to estimate genomic admixture), plotted as heat-maps overlaid on geographical maps. As it turns out, it’s a breeze to make these plots in R, using the ggplot2 package.

Denisovan ancestry in modern human populations, measured as an f4 statistic distribution across the world. Image courtesy: Figure 2 from Qin and Stoneking (2015) http://dx.doi.org/10.1093/molbev/msv141
Denisovan ancestry in modern human populations, measured as an f4 statistic distribution across the world. Image courtesy: Figure 2 from Qin and Stoneking (2015) http://dx.doi.org/10.1093/molbev/msv141

So I simulated some geographical data (which you should switch out for your real GPS coordinates), and obtained posterior density estimates for effective population sizes that I had previously used in another tutorial as an example. Importantly, you will have to edit (1) the density data, here represented by “hpars”, (2) the positions table – here I have simulated some data around coordinates for Philadelphia, PA, which you should replace with your own geographical coordinates for your data of interest from (1).

library(ggmap) #Load libraries
library(ggplot2)
hpars <- read.table("https://sites.google.com/site/arunsethuraman1/teaching/hpars.dat?revision=1") #Read in the density data
positions <- data.frame(lon=rnorm(20000, mean=-75.1803458, sd=0.05),
lat=rnorm(10000,mean=39.98352197, sd=0.05)) #Simulate some geographical coordinates #Switch out for your data that has real GPS coords
map <- get_map(location=c(lon=-75.1803458,
lat=39.98352197), zoom=11, maptype='roadmap', color='bw') #Get the map from Google Maps
ggmap(map, extent = "device") +
geom_density2d(data = positions, aes(x = lon, y = lat), size = 0.3) +
stat_density2d(data = positions,
                 aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), size = 0.01,
                 bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") +
  scale_alpha(range = c(0, 0.3), guide = FALSE) #Plot

And voila! Do give it a try with your data, and leave your comments below.

Geographical heatmap produced using an effective population size density distribution, overlaid on a geographical map of Philadelphia.
Geographical heatmap produced using an effective population size density distribution, overlaid on a geographical map of Philadelphia.
Posted in bioinformatics, howto, R, software | Tagged , | 2 Comments

3 writing mistakes I make

Last week my university hosted Dr. Joshua Schimel, microbiologist and author of Writing Science, who led a half-day writing workshop for graduate students. To be honest, I didn’t expect my writing to improve after a 4-hour workshop, but I learned a lot of great tips and tricks. Talking with the other graduate students, I realized many of us were making the same 3 common style mistakes:
1) Nominalization
Verb nominalization is turning a verb into a noun. Take the following examples:

“We conducted an analysis using the morphological data…”

versus

“We analyzed the morphological data…”

Continue reading

Posted in career | Tagged , , | 5 Comments

A different perspective on genetic architecture

As an ecological geneticist, I’m constantly reminded how much we don’t understand about the genetic nature of adaptive variation. Sure, we have lots of examples of genes/pathways/regions that seem to be responsible for adaptation, but we don’t really know if these are representative of typical evolutionary processes. Are large or small effect mutations important? Does selection act on gene regulation or protein coding regions? Moreover, our expectations have changed through time with the development of theoretical models and improved empirical data collection. In his recent manuscript in Evolution, Remington offers a solution to this problem by putting forth a model of allelic evolution that takes a molecular perspective. Before digging into Remington’s idea, a little background is necessary.
The infinitesimal model argues that quantitative traits are controlled by an infinite number of loci of small effect. This model was extended to Fisher’s geometric model, which provides an evolutionary justification for why the infinitesimal model works. Basically, the geometric model argues that only alleles of small effect have any real chance of being beneficial.
Continue reading

Posted in association genetics, evolution, genomics, mutation, quantitative genetics, theory | Leave a comment

The 2016 Workshop on Genomics summary

I recently had the pleasure to spend two and a half weeks in the beautiful medieval town of Český Krumlov, Czech Republic. The reason was the popular Workshop on Genomics that was running and I was one of the TAs involved in making sure everything ran smoothly.
I thought I’d take the opportunity to summarize some of the key events that took place during the workshop since I’ve gathered that there’s some interest reading about the topics that were discussed. For your convenience, I’ve also linked to the pdf lecture slides of each speaker, so you can take a look at their presentation, and to some of the exercises that we ran.

View over Český Krumlov

View over Český Krumlov and its castle


Continue reading

Posted in bioinformatics, genomics, next generation sequencing | Tagged , , , , , , | Leave a comment

Help us build an independent future for The Molecular Ecologist

Support an independent future for The Molecular Ecologist by donating to our campaign.

Support an independent future for The Molecular Ecologist by donating to our campaign.


Back at the beginning of the year, I laid out a plan for community support of The Molecular Ecologist — today, after two busy months, I’m excited to announce that we’re finally launching the first stage of that plan, a crowdfunding campaign through Indiegogo.
Since 2010, The Molecular Ecologist has been an online forum for readers of and contributors to the journal Molecular Ecology, rounding up “news and commentary for ecology, evolution, and everything in between.” Now we’d like to build on that history to secure independent support, reach a broader audience, and provide resources for early-career scientists studying evolution, ecology, and genetics. What are we asking for, and what do we want to do with it? I’m glad you asked.
Continue reading

Posted in community, funding, housekeeping | Tagged | 1 Comment

Quick and dirty tree building in R

A phylogeny of ten mammal genera, estimated with maximum likelihood methods implemented in R, with nodes showing support values from 100 bootstrap replicates.
A phylogeny of ten mammal genera, estimated with maximum likelihood methods implemented in R, with nodes showing bootstrap support values from 100 replicates.

One of the major obstacles to turning your sequence data into phylogenetic trees is choosing (and learning) a tree-building program. Confounding this problem is the fact that most researchers will want to perform numerous, complementary analyses, each of which may require finding, downloading, compiling, learning, and running different programs, requiring different data formats and producing non-compatible output files.

This can be — to put it mildly — a headache. Sometimes, the headache is unavoidable, and the nature of our datasets is a limiting factor. For example, modern high-throughput sequencing methods produce data sets that are so huge and computationally-intensive to analyze that there’s little choice but to use programs optimized for handing these data (e.g. ExaML).

But what if you have more modest goals, and are interested in inferring single gene trees or estimating phylogenies using small concatenated alignments? The programming language R provides numerous packages to run basic phylogenetic analyses in a streamlined, consistent pipeline, and can be a good choice for getting a feel for your data as it allows for all the advantages of its host platform.

Continue reading
Posted in howto, methods, phylogenetics, R, software | Tagged , , , , , , , | 3 Comments

The Neanderthal admixture plot thickens…

Previous studies of archaic admixture from Altai Neanderthals and Denisovans into modern humans outside of Africa have put forth several lines of evidence for gene flow from Neanderthals into common ancestors of Eurasian populations, from Denisovans into ancestors of modern Oceanic and Asian populations, as well as from an unknown ancestral population into the Denisovan lineage. However, gene flow from modern humans into our extinct near relatives has yet been elusive – which Kuhlwilm et al. (2016) sought to find evidence of. By estimating regions of high and low divergence across 100 kb windows from Altai Neanderthal, Denisovan, and  more than 500 African genomes, they recover the possibility of archaic gene flow from an unknown ancestor into Denisovan genomes, and from modern humans into Altai Neanderthals. Thereon, they use demographic analyses using G-PhoCS (Gronau et al. 2011) to quantify this purported gene flow using five different population trees with Denisovans, Altai Neanderthals and two modern human populations.

"Down by the Rail Yard" by David Adams Image courtesy: Flickr Commons: https://flic.kr/p/996uLp

“Down by the Rail Yard” by David Adams Image courtesy: Flickr Commons: https://flic.kr/p/996uLp


These analyses recover previous estimates of gene flow from (1) Altai Neanderthals into modern humans out of Africa, (2) unknown archaic hominin into the common ancestor of Denisovans, and importantly (3) gene flow of modern humans into the common ancestor of the Neanderthals. This gene flow also appears to stem from a separate lineage that split from the common ancestor of all modern humans in Africa or from an ancient African lineage, a finding that they also confirm using simulations to recapitulate observed levels of divergence. Kuhlwilm et al. (2016) also estimate the age of shared haplotypes using ARGWeaver (Rasmussen et al. 2014), with findings indicating the presence of longer (and thus younger) African haplotypes in the Altai Neanderthals, than in Denisovans, coalescing back around 100,000-230,000 ybp, indicating that these haplotypes were present long before Neanderthals introgressed with humans outside of Africa.
Using newly designed probes for chromosome 21, and two newly sequenced Neanderthal genomes (from Spain and Croatia), the authors also find that the Altai Neanderthal shares more derived alleles from Africa, than the Spanish and Croatian Neanderthals. Separate analyses of gene flow using these new sequences also finds support for gene flow from modern humans into the Spanish and Croatian Neanderthals, and not directly into the Altai, but into the common ancestor. Demographic analyses also indicate population growth in the Croatian Neanderthal, but smaller population sizes compared to modern humans.

Our integrated demographic analysis of multiple archaic and present-day human genomes suggests a scenario of long-term decline in the populations of Neanderthals and Denisovans, with the consistently small Altai Neanderthal population perhaps reflecting a long period of isolation in the Altai Mountains. In addition, we provide evidence for modern human introgression into the ancestors of this population of Neanderthals, and no such evidence in the European Neanderthals.

Reference:
Kuhlwilm, Martin, et al. “Ancient gene flow from early modern humans into Eastern Neanderthals.” Nature (2016). DOI: 10.1038/nature16544
Press coverage:
https://goo.gl/X1ESOu

Posted in bioinformatics, evolution, genomics, mutation, natural history, next generation sequencing, Paleogenomics, population genetics, speciation | Tagged , , , | 2 Comments

How urbanization might affect the five-second rule

image from the wikipedia article, ‘the five second rule’


At this point, we know that microbes are everywhere and make up complex communities found all over the place ranging from oceanic hydrothermal vents to lakes, soils, and, yes of course, all over you. It has also become apparent that our human microbiome plays a role in  health, but there’s still plenty to learn about how our relationship with the microbial communities that live with / on / and around us really affects us.
Unraveling the distribution of microbes in the environment, and figuring out if Baas Becking really had it figured out when he said “everything is everywhere, but, the environment selects” is a fascinating challenge that has been investigated, in part using metagenomic approaches. Recently, there has been a growing interest in investigating the diverse microbial communities associated with man-made structures, so turns out we have another type of environment to contemplate when discussing microbial biogeography.
Continue reading

Posted in community ecology, microbiology | Tagged , , , , , | 2 Comments

Finding hidden structure in uneven data

structure-839656_960_720
If you are a population geneticist, your work might include sampling a bunch of individuals and figuring out who is related to who. Seems simple right? Before you can ask questions about differences or similarities between groups, you have to understand what actually constitutes a group in the first place.
A methodological stalwart of “how many groups do I have?” analyses is the program STRUCTURE, which has been cranking out these ubiquitous plots for more than fifteen years (and to the tune of >10,000 citations). As you can imagine, a program that has been so widely applied as STRUCTURE has been examined, questioned, and improved many times over.
For example, the basic application has been made faster and faster. Additionally, both simulations and empirical investigations have shown the caveats for these analyses (like avoiding close relatives, large temporal variation in sampling, isolation by distance scenarios, etc). Overall though, STRUCTURE is still being used all the time and most people seem fine with it.
Given the history of studying the effectiveness of STRUCTURE, I was surprised to see this new paper by Sebastien Puechmaille in Molecular Ecology Resources, titled “The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub-sampling and new estimators alleviate the problem”.
Now I’m no scholar on the history of population-clustering techniques, but are you telling me that no one ever asked if having uneven sample sizes makes a difference? Say you sample an insect species from three genetically-distinct populations, but have uneven sample sizes (N = 30, 12, 12). According to Puechmaille’s simulations, you are more likely to see those population with lower sample sizes lumped together (even if they really are very different) and the population with a large sample size may be unnecessarily split into multiple.
The solution to this problem ends up being (somewhat) equally simple: keep sample sizes relatively even, subsample large groups, and use a variety of estimators to help you pick the “most supported” number of groups. In fact, Puechmaille offers up a suite of new estimators (MedMeaK, MaxMeaK, MedMedK, MaxMedK) that you can add to your arsenal.
With that out of the way, we can get to the really important questions, like why does population structure matter at all?
 
Puechmaille, S. J. (2016). The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub‐sampling and new estimators alleviate the problem. Molecular Ecology Resources. DOI: 10.1111/1755-0998.12512

Posted in software | Tagged , | Leave a comment

Petrous bone is the new black

I was just reading an article about skeletal reconstruction of another fascinating extinct species when my supervisor came to my office. I asked: “How about we sequence this creature’s genome?” He replied by asking where the animal had lived. As I answered “Africa”, I already knew that it would end the discussion. With “Then get a petrous bone” he left the office.
Where to drill, that is the question
With the improved sequencing technologies, the field of ancient DNA (aDNA) is advancing rapidly. Things that we had once considered impossible have already happened. For now, the border or what is technically possible with aDNA has been set by sequencing the genome of a 700,000-year-old horse.
Recently, some of the aDNA research has focused on answering the basic question ‘Which bones or which part of a bone should we sample to maximize the yield of endogenous DNA?’. Continue reading

Posted in genomics, methods, Paleogenomics | Tagged , , , | 4 Comments