Species distribution models in R

Our example dataset, from Godsoe et al. (2010).

Our example dataset, from Godsoe et al. (2010).

Update, 20 August 2013: Many readers have requested a copy of the Joshua tree data set used as an example in this post, and I’ve finally secured permission from the coauthors of the original study to post it to Dryad. You can download it right here. Happy coding!

One of our most consistently popular posts of the past few months has been Kim Gilbert’s introduction to using geographic data to make maps in R. But once you’ve made a nice map of your collection sites, you’ve only just started to tap the possibilities of spatial data in R. With a suite of packages anchored by dismo, you can use R and open-sourced climate data to determine the environmental conditions your favorite species requires—by building a species distribution model.

Species distribution models (SDMs) are handy any time you want to extrapolate where a species might be based on where you know it actually is. Maybe you’re trying to figure out where would be fruitful to do more sampling; maybe you want to know where your favorite critters probably lived back during the last ice age; maybe you want to know what regions will be suitable for your favorite critters after another century of global climate change. The basic idea is,

  1. Take a list of locations where you know you can find a species and identify the climate (or other environmental conditions) at those locations;
  2. Build a statistical model (using one or more of several available methods) that differentiates the climate (or other conditions) at the points where your species is found from other points where your species isn’t found; and
  3. Apply the model to climate (or other conditions) from some other time or place to estimate a probability that your species would be happy there.

This post provides a little demonstration of what you can do given a reasonably good set of collection sites and a no-longer-cutting-edge laptop. But caveat lector: actually building SDMs for publication-grade analysis requires a lot more work that what I’m presenting here. If you like the possibilities, you should start by reading the “vingnette” documents “Species distribution modeling in R” [pdf] (by dismo developers Roger Hijmans and Jane Elith) and “Boosted regression trees for ecological modeling” [PDF] (by Elith and John Leathwick). Those provide lots of detail and further reading lists, including a relatively recent review by Hijmans and Elith. They’re also the starting point for the demonstrations below.

First, you’ll want some distribution data. For this demonstration, I’ll use coordinates from a 2009 SDM study of Joshua trees by Godsoe et al. (fullest possible disclosure: I’m part of et al.). These are more than 5,000 locations where Joshua trees have been spotted—which is relatively easy, since they’re the tallest things that grow in most of the Mojave Desert—compiled from public databases, then ground-truthed and supplemented by the coauthors. All of those points are plotted in the map up at the beginning of the post; here’s the code that will set that up:

# packages for mapping, and the data for, e.g., state borders
require(maptools)
require(maps)
require(mapdata)
require(dismo)# dismo has the SDM analyses we"ll need
# load the table of latitude and longitude coordinates
locs = read.csv(file="JoTrPresence02202008.txt", header=T, sep="\t")
# then plot these points to check them ...
data(stateMapEnv) # load the database with the U.S. state borders
# notice we're limiting the extent of the map to focus on the Mojave Desert region
plot(c(-119, -113), c(33.5, 38), mar=par("mar"), xlab="longitude", ylab="latitude", xaxt="n", yaxt="n", type="n", main="Joshua tree presence data")
rect(par("usr")[1],par("usr")[3],par("usr")[2],par("usr")[4], col="lightblue")
map("state", xlim=c(-119, -113), ylim=c(33.5, 38), fill=T, col="cornsilk", add=T)
# add some nice state labels ...
text(x=-117.5, y=35.5, "California", col="cornsilk3", cex=3)
text(x=-116, y=37.5, "Nevada", col="cornsilk3", cex=3)
text(x=-113, y=34.5, "Arizona", col="cornsilk3", cex=3)
text(x=-113, y=37.75, "Utah", col="cornsilk3", cex=3)
# plot the points
points(locs$longitude, locs$latitude, col="darkolivegreen4", pch=20, cex=0.5)
# add some axes
axis(1,las=1)
axis(2,las=1)
# and add the box around the map
box()

The first thing you might notice is that they’re not very evenly distributed. Some of that is because Joshua trees aren’t very evenly distributed, but some of this is sampling effort isn’t very evenly distributed. To compensate for that, we can divide the map up into a grid and randomly draw a single point from each grid square. I’m using grid squares of 0.05 degrees of latitude x 0.05 degrees of longitude, which are equivalent to somewhat less than 5 kilometers square at these latitudes. (This is rather crude at this scale; over much larger areas, an accurate version of this calculation gets complicated, because the Earth, incoveniently, is not flat.)

# create sequences of latitude and longitude values to define the grid
longrid = seq(-119,-113,0.05)
latgrid = seq(33.5,38,0.05)
# identify points within each grid cell, draw one at random
subs = c()
for(i in 1:(length(longrid)-1)){
  for(j in 1:(length(latgrid)-1)){
    gridsq = subset(locs, latitude > latgrid[j] & latitude < latgrid[j+1] & longitude > longrid[i] & longitude < longrid[i+1])     if(dim(gridsq)[1]>0){
      subs = rbind(subs, gridsq[sample(1:dim(gridsq)[1],1 ), ])
    }
  }
}
dim(subs) # confirm that you have a smaller dataset than you started with

And, when you plot the new, subsampled dataset on top of the original, you should see a more even distribution.

Subsampled points in lighter green, superimposed on the original points in darker green.

Subsampled points in lighter green, superimposed on the original points in darker green.

Now you have your location points. But you also need some points that define regions where Joshua trees aren’t found. This is a bit tricky, philosophically speaking. The best you can do is hike the Mojave Desert to identify points where a Joshua tree isn’t growing—but how do you know it wouldn’t grow at any given point, given the chance? Conversely, you could sample points where you know Joshua trees will never grow—like, say, the Canadian Rockies—but that wouldn’t make a very informative model. One option is to define a “background” region to sample at random, which (we hope) captures environmental conditions that Joshua trees could disperse to, but hasn’t.

To do this, we’ll define circular regions, each with a radius of 50km, centered on each point in our presence list, then draw random points that must fall within those circles. Effectively, this draws random points that must be within 50 of a point where Joshua trees have been spotted. These will be in regions that have reasonably similar environments to the points where we’ve identified Joshua trees.

# define circles with a radius of 50 km around the subsampled points
x = circles(subs[,c("longitude","latitude")], d=50000, lonlat=T)
# draw random points that must fall within the circles in object x
bg = spsample(x@polygons, 1000, type='random', iter=1000)

If we plot the circles (dark areas) and the random “pseudo-absence” points (orange circles) in bg on our map, here’s what they look like:

Our "pseudo-absences," drawn within 50 km of at least one Joshua tree.

Our “pseudo-absences,” drawn within 50 km of at least one Joshua tree.

We have presence points and absence points. Now we need to know what the climate is like at those points. For this, we’re ready for climate data. Conveniently, the raster package provides functions to download the Bioclim open-source climate dataset, 19 variables derived from the WORLDCLIM database of temperature and rainfall measurements taken worldwide from 1950 to 2000. Bioclim is available at several different spatial resolutions; this demostration is for the 2.5 arc-minute resolution dataset, which contains worldwide data in a single file. At the highest resolution, you can only download it in 30-arc-second-square “tiles.” Even at this lower resolution, the file is hundreds of megabytes; we’re going to need a fair bit of memory.

require(raster) # package to handle raster-formatted spatial data
BClim = getData("worldclim", var="bio", res=2.5, path="data/")

Note also the path option in the getData() function; this lets us specify a location to place the downloaded data, and a place that R will check before it downloads again, in a future session. If the file is unwieldy, you might want to cut it down by defining the geographic “extent” you’re interested in, and using crop() to, er, crop the data file. The extent is defined as (1) the westernmost longitude, (2) the easternmost longitude, (3) the southernmost latitude, and (4) the northermost latitude:

YbrevRange = extent(-119.25,-112.75,33.25,38.25) # define the extent

Then crop the Bioclim data, and save it somewhere to use again:

BClim = crop(BClim, YbrevRange)
writeRaster(BClim, filename="data/YbrevBC_2.5.grd", overwrite=T)

To re-load that cropped dataset, you’d use

BClim = brick("data/YbrevBC_2.5.grd")

Once we’ve got a managable dataset stored in the variable BClim, we can plot it quite easily:

# this format plots the first (of 19) variables stored in BClim; change the 1 to 2-19 for the others
plot(BClim, 1, cex=0.5, legend=T, mar=par("mar"), xaxt="n", yaxt="n", main="Annual mean temperature (ºC x 10)")
map("state", xlim=c(-119, -113), ylim=c(33.5, 38), fill=F, col="cornsilk", add=T)
# state names
text(x=-117.5, y=35.5, "California", col=rgb(1,1,1,0.4), cex=3)
text(x=-116, y=37.5, "Nevada", col=rgb(1,1,1,0.4), cex=3)
text(x=-113, y=34.5, "Arizona", col=rgb(1,1,1,0.4), cex=3)
text(x=-113, y=37.75, "Utah", col=rgb(1,1,1,0.4), cex=3)
# plot the presence points
points(pts, pch=20, cex=0.5, col="darkgreen")
# and the pseudo-absence points
points(bg, cex=0.5, col="darkorange3")
# add axes
axis(1,las=1)
axis(2,las=1)
# restore the box around the map
box()

The first Bioclim variable is annual mean temperature, in degrees Centigrade times 10. Which gives us something like this:

Annual mean temperature across the Mojave Desert.

Annual mean temperature across the Mojave Desert.

To actually extract the values for each of the 19 Bioclim variables, we just use the extract() function:

# pulling bioclim values
Ybrev_bc = extract(BClim, subs[,c("longitude","latitude")]) # for the subsampled presence points
bg_bc = extract(BClim, bg) # for the pseudo-absence points

You’ll probably want to build some useful dataframes, with two columns of coordinates followed by the 19 bioclim variables. First, for the presence points:

Ybrev_bc = data.frame(lon=subs$longitude, lat=subs$latitude, Ybrev_bc)

And then for the pseudo-absences:

bgpoints = bg@coords
colnames(bgpoints) = c("lon","lat")
bg_bc = data.frame(cbind(bgpoints,bg_bc))
length(which(is.na(bg_bc$bio1))) # double-check for missing data
bg_bc = bg_bc[!is.na(bg_bc$bio1), ] # and pull out the missing lines

To estimate the effectiveness of a predictive model, it’s common practice to set aside part of the dataset to validate the model. We’ll estimate the model from the total dataset minus the set-aside, then check the model’s ability to correctly identify the points in the set-aside as presences or absences. If we then re-divide the data, then re-estimate the model, then re-validate, we can begin to get a sense for how effective a given method is at estimating the SDM. For this purpose, the function kfold() creates a vector of values you can use to break your data into k groups of equal size, each of which can be used as the set-aside for testing over several iterations of this validation cycle.

group_p = kfold(Ybrev_bc, 5) # vector of group assignments splitting the Ybrev_bc into 5 groups
gropu_a = kfold(bg_bc, 5) # ditto for bg_bc

Now we have the data we can (finally!) use to build a species distribution model. One common method for this is the algorithm implemented in Maxent, which uses a “maximum entropy” method to identify differences between conditions at presence points and conditions at absence points. The dismo package includes functions that call and manipulate Maxent, but first you need to install the javascript maxent.jar in an appropriate location. When you call Maxent from within R for the first time, if this file isn’t in place, R will complain that it’s not there, and tell you where it should be. On my MacBook, the folder is /Library/Frameworks/R.framework/Versions/3.0/Resources/ library/dismo/java/.

Then, we’re ready to go. Since we defined 5 groups, let’s pick a number between 1 at 5 to use for our validation set-aside:

test = 3

Then we use test and the kfold groupings to divide the presence and absence points:

train_p = Ybrev_bc[group_p!=test, c("lon","lat")]
train_a = bg_bc[group_a!=test, c("lon","lat")]
test_p = Ybrev_bc[group_p==test, c("lon","lat")]
test_a = bg_bc[group_a==test, c("lon","lat")]

Now, estimate a maxent model using the “training” points and the Bioclim data. This may take a few moments to run:

me = maxent(BClim, p=train_p, a=train_a)

To validate the model, we use the appropriately named function.

e = evaluate(test_p, test_a, me, BClim)
e

Among the stats you’ll see when you call the variable e is the AUC, or area under curve, statistic. This is a measure of how good the model me is at identifying a given geographic point as either a presence or an absence; when AUC = 0.5, the model is no better than a coin flip. For the data and I’ve been using up to now, the AUC is about 0.842, which isn’t terrible. If I were serious about developing a publishable SDM, I’d now repeat the model estimation procedure with different values for test, to use different portions of the total data for validation; the range of AUC scores I got would tell me, in general, how well the default Maxent procedure works, and I could tweak the procedure I use for selecting a set of pseudo-absence points, or the range of Bioclim variables I use to try and improve the average AUC.

In the meantime, to visualize the predictions from our model, we can estimate its predicted probability of Joshua tree presence for every cell in our Bioclim data set, then plot those probabilities:

pred_me = predict(me, BClim) # generate the predictions
# make a nice plot
plot(pred_me, 1, cex=0.5, legend=T, mar=par("mar"), xaxt="n", yaxt="n", main="Predicted presence of Joshua trees")
map("state", xlim=c(-119, -113), ylim=c(33.5, 38), fill=F, col="cornsilk", add=T)
# state names
text(x=-117.5, y=35.5, "California", col=rgb(1,1,1,0.6), cex=3)
text(x=-116, y=37.5, "Nevada", col=rgb(1,1,1,0.6), cex=3)
text(x=-113, y=34.5, "Arizona", col=rgb(1,1,1,0.6), cex=3)
text(x=-113, y=37.75, "Utah", col=rgb(1,1,1,0.6), cex=3)
# presence points
points(pts, pch=20, cex=0.5, col="darkgreen")
# pseud-absence points
points(bg, cex=0.5, col="darkorange3")
# add axes
axis(1,las=1)
axis(2,las=1)
# restore the box around the map
box()

And here’s what that code produces:

Probability of Joshua trees, given our Maxent model and the Bioclim data.

Probability of Joshua trees, given our Maxent model and the Bioclim data.

With our presence and pseudo-absence points plotted over it, you can see there are some places where the model is good at differentiating between them, and others where it’s not so good.

That puts R through most of its SDM-estimation paces. Again, this is a starting point at best! Before you get too far with this, you’ll want to look into alternative estimation methods, such as boosted regression trees (Godsoe et al. used both Maxent and BRTs), and think very carefully about how you select presence and absence points, what environmental data you use, and how you handle them all.

Good luck!

References

Barve, N., Barve, V., Jiménez-Valverde, A., Lira-Noriega, A., Maher, S.P., Peterson, a. T., et al. 2011. The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological Modelling 222: 1810–1819. doi: 10.1016/j.ecolmodel.2011.02.011.

Elith, J. & Leathwick, J.R. 2009. Species distribution models: Ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics 40: 677–697. doi: 10.1146/annurev.ecolsys.110308.120159

Godsoe, W. 2010. I can’t define the niche but I know it when I see it: a formal link between statistical theory and the ecological niche. Oikos 119: 53–60. doi: 10.1111/j.1600-0706.2009.17630.x.

Godsoe, W.K.W., Smith, C.I., Strand, E., Yoder, J.B., Pellmyr, O. & Esque, T.C. 2009. Divergence in an obligate mutualism is not explained by divergent climatic factors. New Phytologist 183: 589–599. doi: 10.1111/j.1469-8137.2009.02942.x

RedditDiggMendeleyPocketShare and Enjoy

About Jeremy Yoder

Jeremy Yoder is a postdoctoral associate in the Department of Plant Biology at the University of Minnesota. He also blogs at Denim and Tweed and Nothing in Biology Makes Sense!, and tweets under the handle @jbyoder.
This entry was posted in howto, R, software and tagged , . Bookmark the permalink.
  • http://www.facebook.com/tim.vines.71 Tim Vines

    There’s a thought provoking/nerdy discussion of whether MAXENT is really just a GLM here: http://methodsblog.wordpress.com/2013/02/20/some-big-news-about-maxent/

  • Paul

    Is the JoTrPresence txt file available (or some subset of it) for download? I couldn’t find it and would help me work through the example.

    Thanks,

    • http://www.jeremybyoder.com/ jeremyyoder

      Sorry—it isn’t posted online. Let me see what I can do about that.

    • http://www.denimandtweed.com/ Jeremy Yoder

      It’s not available online, I’m afraid—but let me see what I can do about that.

      • Marcelo

        Can you to available the dataset? Because I’m a beginner in R, and will help me a lot!
        Thanks and congratulations

        • http://www.denimandtweed.com/ Jeremy Yoder

          As noted in the update at the beginning of the post, the dataset has been posted to Dryad.

    • Jimmy

      Agreed; a file of even a subset or random points would make it possible to follow along and easier to learn. Thanks for posting this!

  • Pingback: Will climate change be more relentless than evolution? | The Molecular Ecologist()

  • Pingback: By popular request … | The Molecular Ecologist()

  • Pingback: Spatial and movement analyses in R: links | Marco Plebani()

  • Pingback: Crowd-sourcing natural history | The Molecular Ecologist()

  • Ali Tahir

    Hi,
    First of thank you for writing this tutorial, it was really nice to follow through all the steps. Actually I want to Cytotype/Genotype distribution modelling for A Brassicaceae species I am work on. I have few questions, if you could help me:

    # The main methods that are currently used to generate pseudo-absence points are:

    1) randomly generated pseudo-absence locations from background data

    2) pseudo-absence locations generated within a delimited geographical distance from recorded presence points;

    3) pseudo-absence locations selected in areas that are environmentally dissimilar from presence points.

    In your worked example you choose the 2 method to generate pseudo-absences, but if I want to go by third method could suggest me how can I tweak the code, I tried few things but could not make it work.

    # to produce pseudo-absence points that are spatially and ecologically balanced. #
    ##################################################################
    x = circles(subs[,c("longitude","latitude")], d=50000, lonlat=T)
    bg = spsample(x@polygons, 1000, type=’random’, iter=1000)
    points(bg,col=”orange”,pch=20,cex=0.2)
    ###################################################################

    I first extracted climatic variables, defined the extent and cropped variables and later stacked them to put in vector “BClim”, then I tried this command to produce pseudo absences within set range limits for all the points where climate is dissimilar to the presence points:

    ext=extent(-10,80,20,61)
    bg=randomPoints(BClim, n=1000,ext=ext,extf=1.00)

    I get the absences, and could plot them on map with presences but get an error when
    I tries to extract climatic variables and make data frame……….. Could you please suggest some thing to get around this issue. Thank you.

    Best regards,

    Ali

    • http://www.denimandtweed.com/ Jeremy Yoder

      Well, of course I don’t know exactly what you want to do with your species distribution model, but I would really recommend against the kind of background sampling you’re describing. For most applications, the whole point of a SDM is to identify the differences between where your species of interest occurs and where it doesn’t. So generating “absence” points by pre-selecting environments that look to you to be different from your presence data would makes the resulting SDM pretty uninformative.

      That said, it’s also not clear from the code you’ve posted why you’d have an issue with extracting climate variables for the absence points you chose. (What code did you use for the actual extraction? What error did you get?) The example code in my post should work for any combination of (x, y) coordinates and raster data, provided that the coordinates given fall within the extent of the raster.

  • Pingback: Scanning the genome for local adaptation | The Molecular Ecologist()

  • Pingback: This boring-looking grass can occupy an extra 10,000 square miles, thanks to a helpful fungus « Nothing in Biology Makes Sense!()