Species distribution models in R

Our example dataset, from Godsoe et al. (2010).

Our example dataset, from Godsoe et al. (2010).

Update, 20 August 2013: Many readers have requested a copy of the Joshua tree data set used as an example in this post, and I’ve finally secured permission from the coauthors of the original study to post it to Dryad. You can download it right here. Happy coding!

One of our most consistently popular posts of the past few months has been Kim Gilbert’s introduction to using geographic data to make maps in R. But once you’ve made a nice map of your collection sites, you’ve only just started to tap the possibilities of spatial data in R. With a suite of packages anchored by dismo, you can use R and open-sourced climate data to determine the environmental conditions your favorite species requires—by building a species distribution model.

Species distribution models (SDMs) are handy any time you want to extrapolate where a species might be based on where you know it actually is. Maybe you’re trying to figure out where would be fruitful to do more sampling; maybe you want to know where your favorite critters probably lived back during the last ice age; maybe you want to know what regions will be suitable for your favorite critters after another century of global climate change. The basic idea is,

  1. Take a list of locations where you know you can find a species and identify the climate (or other environmental conditions) at those locations;
  2. Build a statistical model (using one or more of several available methods) that differentiates the climate (or other conditions) at the points where your species is found from other points where your species isn’t found; and
  3. Apply the model to climate (or other conditions) from some other time or place to estimate a probability that your species would be happy there.

This post provides a little demonstration of what you can do given a reasonably good set of collection sites and a no-longer-cutting-edge laptop. But caveat lector: actually building SDMs for publication-grade analysis requires a lot more work that what I’m presenting here. If you like the possibilities, you should start by reading the “vingnette” documents “Species distribution modeling in R” [pdf] (by dismo developers Roger Hijmans and Jane Elith) and “Boosted regression trees for ecological modeling” [PDF] (by Elith and John Leathwick). Those provide lots of detail and further reading lists, including a relatively recent review by Hijmans and Elith. They’re also the starting point for the demonstrations below.

First, you’ll want some distribution data. For this demonstration, I’ll use coordinates from a 2009 SDM study of Joshua trees by Godsoe et al. (fullest possible disclosure: I’m part of et al.). These are more than 5,000 locations where Joshua trees have been spotted—which is relatively easy, since they’re the tallest things that grow in most of the Mojave Desert—compiled from public databases, then ground-truthed and supplemented by the coauthors. All of those points are plotted in the map up at the beginning of the post; here’s the code that will set that up:

# packages for mapping, and the data for, e.g., state borders
require(dismo)# dismo has the SDM analyses we"ll need
# load the table of latitude and longitude coordinates
locs = read.csv(file="JoTrPresence02202008.txt", header=T, sep="\t")
# then plot these points to check them ...
data(stateMapEnv) # load the database with the U.S. state borders
# notice we're limiting the extent of the map to focus on the Mojave Desert region
plot(c(-119, -113), c(33.5, 38), mar=par("mar"), xlab="longitude", ylab="latitude", xaxt="n", yaxt="n", type="n", main="Joshua tree presence data")
rect(par("usr")[1],par("usr")[3],par("usr")[2],par("usr")[4], col="lightblue")
map("state", xlim=c(-119, -113), ylim=c(33.5, 38), fill=T, col="cornsilk", add=T)
# add some nice state labels ...
text(x=-117.5, y=35.5, "California", col="cornsilk3", cex=3)
text(x=-116, y=37.5, "Nevada", col="cornsilk3", cex=3)
text(x=-113, y=34.5, "Arizona", col="cornsilk3", cex=3)
text(x=-113, y=37.75, "Utah", col="cornsilk3", cex=3)
# plot the points
points(locs$longitude, locs$latitude, col="darkolivegreen4", pch=20, cex=0.5)
# add some axes
# and add the box around the map

The first thing you might notice is that they’re not very evenly distributed. Some of that is because Joshua trees aren’t very evenly distributed, but some of this is sampling effort isn’t very evenly distributed. To compensate for that, we can divide the map up into a grid and randomly draw a single point from each grid square. I’m using grid squares of 0.05 degrees of latitude x 0.05 degrees of longitude, which are equivalent to somewhat less than 5 kilometers square at these latitudes. (This is rather crude at this scale; over much larger areas, an accurate version of this calculation gets complicated, because the Earth, incoveniently, is not flat.)

# create sequences of latitude and longitude values to define the grid
longrid = seq(-119,-113,0.05)
latgrid = seq(33.5,38,0.05)
# identify points within each grid cell, draw one at random
subs = c()
for(i in 1:(length(longrid)-1)){
  for(j in 1:(length(latgrid)-1)){
    gridsq = subset(locs, latitude > latgrid[j] & latitude < latgrid[j+1] & longitude > longrid[i] & longitude < longrid[i+1])     if(dim(gridsq)[1]>0){
      subs = rbind(subs, gridsq[sample(1:dim(gridsq)[1],1 ), ])
dim(subs) # confirm that you have a smaller dataset than you started with

And, when you plot the new, subsampled dataset on top of the original, you should see a more even distribution.

Subsampled points in lighter green, superimposed on the original points in darker green.

Subsampled points in lighter green, superimposed on the original points in darker green.

Now you have your location points. But you also need some points that define regions where Joshua trees aren’t found. This is a bit tricky, philosophically speaking. The best you can do is hike the Mojave Desert to identify points where a Joshua tree isn’t growing—but how do you know it wouldn’t grow at any given point, given the chance? Conversely, you could sample points where you know Joshua trees will never grow—like, say, the Canadian Rockies—but that wouldn’t make a very informative model. One option is to define a “background” region to sample at random, which (we hope) captures environmental conditions that Joshua trees could disperse to, but hasn’t.

To do this, we’ll define circular regions, each with a radius of 50km, centered on each point in our presence list, then draw random points that must fall within those circles. Effectively, this draws random points that must be within 50 of a point where Joshua trees have been spotted. These will be in regions that have reasonably similar environments to the points where we’ve identified Joshua trees.

# define circles with a radius of 50 km around the subsampled points
x = circles(subs[,c("longitude","latitude")], d=50000, lonlat=T)
# draw random points that must fall within the circles in object x
bg = spsample(x@polygons, 1000, type='random', iter=1000)

If we plot the circles (dark areas) and the random “pseudo-absence” points (orange circles) in bg on our map, here’s what they look like:

Our "pseudo-absences," drawn within 50 km of at least one Joshua tree.

Our “pseudo-absences,” drawn within 50 km of at least one Joshua tree.

We have presence points and absence points. Now we need to know what the climate is like at those points. For this, we’re ready for climate data. Conveniently, the raster package provides functions to download the Bioclim open-source climate dataset, 19 variables derived from the WORLDCLIM database of temperature and rainfall measurements taken worldwide from 1950 to 2000. Bioclim is available at several different spatial resolutions; this demostration is for the 2.5 arc-minute resolution dataset, which contains worldwide data in a single file. At the highest resolution, you can only download it in 30-arc-second-square “tiles.” Even at this lower resolution, the file is hundreds of megabytes; we’re going to need a fair bit of memory.

require(raster) # package to handle raster-formatted spatial data
BClim = getData("worldclim", var="bio", res=2.5, path="data/")

Note also the path option in the getData() function; this lets us specify a location to place the downloaded data, and a place that R will check before it downloads again, in a future session. If the file is unwieldy, you might want to cut it down by defining the geographic “extent” you’re interested in, and using crop() to, er, crop the data file. The extent is defined as (1) the westernmost longitude, (2) the easternmost longitude, (3) the southernmost latitude, and (4) the northermost latitude:

YbrevRange = extent(-119.25,-112.75,33.25,38.25) # define the extent

Then crop the Bioclim data, and save it somewhere to use again:

BClim = crop(BClim, YbrevRange)
writeRaster(BClim, filename="data/YbrevBC_2.5.grd", overwrite=T)

To re-load that cropped dataset, you’d use

BClim = brick("data/YbrevBC_2.5.grd")

Once we’ve got a managable dataset stored in the variable BClim, we can plot it quite easily:

# this format plots the first (of 19) variables stored in BClim; change the 1 to 2-19 for the others
plot(BClim, 1, cex=0.5, legend=T, mar=par("mar"), xaxt="n", yaxt="n", main="Annual mean temperature (ºC x 10)")
map("state", xlim=c(-119, -113), ylim=c(33.5, 38), fill=F, col="cornsilk", add=T)
# state names
text(x=-117.5, y=35.5, "California", col=rgb(1,1,1,0.4), cex=3)
text(x=-116, y=37.5, "Nevada", col=rgb(1,1,1,0.4), cex=3)
text(x=-113, y=34.5, "Arizona", col=rgb(1,1,1,0.4), cex=3)
text(x=-113, y=37.75, "Utah", col=rgb(1,1,1,0.4), cex=3)
# plot the presence points
points(pts, pch=20, cex=0.5, col="darkgreen")
# and the pseudo-absence points
points(bg, cex=0.5, col="darkorange3")
# add axes
# restore the box around the map

The first Bioclim variable is annual mean temperature, in degrees Centigrade times 10. Which gives us something like this:

Annual mean temperature across the Mojave Desert.

Annual mean temperature across the Mojave Desert.

To actually extract the values for each of the 19 Bioclim variables, we just use the extract() function:

# pulling bioclim values
Ybrev_bc = extract(BClim, subs[,c("longitude","latitude")]) # for the subsampled presence points
bg_bc = extract(BClim, bg) # for the pseudo-absence points

You’ll probably want to build some useful dataframes, with two columns of coordinates followed by the 19 bioclim variables. First, for the presence points:

Ybrev_bc = data.frame(lon=subs$longitude, lat=subs$latitude, Ybrev_bc)

And then for the pseudo-absences:

bgpoints = bg@coords
colnames(bgpoints) = c("lon","lat")
bg_bc = data.frame(cbind(bgpoints,bg_bc))
length(which(is.na(bg_bc$bio1))) # double-check for missing data
bg_bc = bg_bc[!is.na(bg_bc$bio1), ] # and pull out the missing lines

To estimate the effectiveness of a predictive model, it’s common practice to set aside part of the dataset to validate the model. We’ll estimate the model from the total dataset minus the set-aside, then check the model’s ability to correctly identify the points in the set-aside as presences or absences. If we then re-divide the data, then re-estimate the model, then re-validate, we can begin to get a sense for how effective a given method is at estimating the SDM. For this purpose, the function kfold() creates a vector of values you can use to break your data into k groups of equal size, each of which can be used as the set-aside for testing over several iterations of this validation cycle.

group_p = kfold(Ybrev_bc, 5) # vector of group assignments splitting the Ybrev_bc into 5 groups
gropu_a = kfold(bg_bc, 5) # ditto for bg_bc

Now we have the data we can (finally!) use to build a species distribution model. One common method for this is the algorithm implemented in Maxent, which uses a “maximum entropy” method to identify differences between conditions at presence points and conditions at absence points. The dismo package includes functions that call and manipulate Maxent, but first you need to install the javascript maxent.jar in an appropriate location. When you call Maxent from within R for the first time, if this file isn’t in place, R will complain that it’s not there, and tell you where it should be. On my MacBook, the folder is /Library/Frameworks/R.framework/Versions/3.0/Resources/ library/dismo/java/.

Then, we’re ready to go. Since we defined 5 groups, let’s pick a number between 1 at 5 to use for our validation set-aside:

test = 3

Then we use test and the kfold groupings to divide the presence and absence points:

train_p = Ybrev_bc[group_p!=test, c("lon","lat")]
train_a = bg_bc[group_a!=test, c("lon","lat")]
test_p = Ybrev_bc[group_p==test, c("lon","lat")]
test_a = bg_bc[group_a==test, c("lon","lat")]

Now, estimate a maxent model using the “training” points and the Bioclim data. This may take a few moments to run:

me = maxent(BClim, p=train_p, a=train_a)

To validate the model, we use the appropriately named function.

e = evaluate(test_p, test_a, me, BClim)

Among the stats you’ll see when you call the variable e is the AUC, or area under curve, statistic. This is a measure of how good the model me is at identifying a given geographic point as either a presence or an absence; when AUC = 0.5, the model is no better than a coin flip. For the data and I’ve been using up to now, the AUC is about 0.842, which isn’t terrible. If I were serious about developing a publishable SDM, I’d now repeat the model estimation procedure with different values for test, to use different portions of the total data for validation; the range of AUC scores I got would tell me, in general, how well the default Maxent procedure works, and I could tweak the procedure I use for selecting a set of pseudo-absence points, or the range of Bioclim variables I use to try and improve the average AUC.

In the meantime, to visualize the predictions from our model, we can estimate its predicted probability of Joshua tree presence for every cell in our Bioclim data set, then plot those probabilities:

pred_me = predict(me, BClim) # generate the predictions
# make a nice plot
plot(pred_me, 1, cex=0.5, legend=T, mar=par("mar"), xaxt="n", yaxt="n", main="Predicted presence of Joshua trees")
map("state", xlim=c(-119, -113), ylim=c(33.5, 38), fill=F, col="cornsilk", add=T)
# state names
text(x=-117.5, y=35.5, "California", col=rgb(1,1,1,0.6), cex=3)
text(x=-116, y=37.5, "Nevada", col=rgb(1,1,1,0.6), cex=3)
text(x=-113, y=34.5, "Arizona", col=rgb(1,1,1,0.6), cex=3)
text(x=-113, y=37.75, "Utah", col=rgb(1,1,1,0.6), cex=3)
# presence points
points(pts, pch=20, cex=0.5, col="darkgreen")
# pseud-absence points
points(bg, cex=0.5, col="darkorange3")
# add axes
# restore the box around the map

And here’s what that code produces:

Probability of Joshua trees, given our Maxent model and the Bioclim data.

Probability of Joshua trees, given our Maxent model and the Bioclim data.

With our presence and pseudo-absence points plotted over it, you can see there are some places where the model is good at differentiating between them, and others where it’s not so good.

That puts R through most of its SDM-estimation paces. Again, this is a starting point at best! Before you get too far with this, you’ll want to look into alternative estimation methods, such as boosted regression trees (Godsoe et al. used both Maxent and BRTs), and think very carefully about how you select presence and absence points, what environmental data you use, and how you handle them all.

Good luck!


Barve, N., Barve, V., Jiménez-Valverde, A., Lira-Noriega, A., Maher, S.P., Peterson, a. T., et al. 2011. The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological Modelling 222: 1810–1819. doi: 10.1016/j.ecolmodel.2011.02.011.

Elith, J. & Leathwick, J.R. 2009. Species distribution models: Ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics 40: 677–697. doi: 10.1146/annurev.ecolsys.110308.120159

Godsoe, W. 2010. I can’t define the niche but I know it when I see it: a formal link between statistical theory and the ecological niche. Oikos 119: 53–60. doi: 10.1111/j.1600-0706.2009.17630.x.

Godsoe, W.K.W., Smith, C.I., Strand, E., Yoder, J.B., Pellmyr, O. & Esque, T.C. 2009. Divergence in an obligate mutualism is not explained by divergent climatic factors. New Phytologist 183: 589–599. doi: 10.1111/j.1469-8137.2009.02942.x


About Jeremy Yoder

Jeremy Yoder is an Assistant Professor of Biology at California State University, Northridge. He also blogs at Denim and Tweed, and tweets under the handle @jbyoder.

This entry was posted in howto, R, software and tagged , . Bookmark the permalink.
  • There’s a thought provoking/nerdy discussion of whether MAXENT is really just a GLM here: http://methodsblog.wordpress.com/2013/02/20/some-big-news-about-maxent/

  • Paul

    Is the JoTrPresence txt file available (or some subset of it) for download? I couldn’t find it and would help me work through the example.


    • Sorry—it isn’t posted online. Let me see what I can do about that.

    • It’s not available online, I’m afraid—but let me see what I can do about that.

      • Marcelo

        Can you to available the dataset? Because I’m a beginner in R, and will help me a lot!
        Thanks and congratulations

        • As noted in the update at the beginning of the post, the dataset has been posted to Dryad.

          • Hilde Zenil

            I don’t think the link is working

          • Both the link in my comment and the one in the notice at the top of the post work for me. I’m not sure what to tell you.

          • Hilde Zenil

            Actually it was working today!!

    • Jimmy

      Agreed; a file of even a subset or random points would make it possible to follow along and easier to learn. Thanks for posting this!

  • Pingback: Will climate change be more relentless than evolution? | The Molecular Ecologist()

  • Pingback: By popular request … | The Molecular Ecologist()

  • Pingback: Spatial and movement analyses in R: links | Marco Plebani()

  • Pingback: Crowd-sourcing natural history | The Molecular Ecologist()

  • Ali Tahir

    First of thank you for writing this tutorial, it was really nice to follow through all the steps. Actually I want to Cytotype/Genotype distribution modelling for A Brassicaceae species I am work on. I have few questions, if you could help me:

    # The main methods that are currently used to generate pseudo-absence points are:

    1) randomly generated pseudo-absence locations from background data

    2) pseudo-absence locations generated within a delimited geographical distance from recorded presence points;

    3) pseudo-absence locations selected in areas that are environmentally dissimilar from presence points.

    In your worked example you choose the 2 method to generate pseudo-absences, but if I want to go by third method could suggest me how can I tweak the code, I tried few things but could not make it work.

    # to produce pseudo-absence points that are spatially and ecologically balanced. #
    x = circles(subs[,c(“longitude”,”latitude”)], d=50000, lonlat=T)
    bg = spsample(x@polygons, 1000, type=’random’, iter=1000)

    I first extracted climatic variables, defined the extent and cropped variables and later stacked them to put in vector “BClim”, then I tried this command to produce pseudo absences within set range limits for all the points where climate is dissimilar to the presence points:

    bg=randomPoints(BClim, n=1000,ext=ext,extf=1.00)

    I get the absences, and could plot them on map with presences but get an error when
    I tries to extract climatic variables and make data frame……….. Could you please suggest some thing to get around this issue. Thank you.

    Best regards,


    • Well, of course I don’t know exactly what you want to do with your species distribution model, but I would really recommend against the kind of background sampling you’re describing. For most applications, the whole point of a SDM is to identify the differences between where your species of interest occurs and where it doesn’t. So generating “absence” points by pre-selecting environments that look to you to be different from your presence data would makes the resulting SDM pretty uninformative.

      That said, it’s also not clear from the code you’ve posted why you’d have an issue with extracting climate variables for the absence points you chose. (What code did you use for the actual extraction? What error did you get?) The example code in my post should work for any combination of (x, y) coordinates and raster data, provided that the coordinates given fall within the extent of the raster.

    • Hamed Sangoony

      Dear Ali

      Your problem is a result of your method in creating random points. You used an extent (which practically is a rectangle) and create your random points on it. But your raster files will not (most likely) fill the whole extent, so, when the code is trying to extract values from rasters to points, there are many points that are on “no data” parts of Rasters.

      I hope this could be useful for you, although it passed more than a year from your question!!


      Hamed Sangoony

  • Pingback: Scanning the genome for local adaptation | The Molecular Ecologist()

  • Pingback: This boring-looking grass can occupy an extra 10,000 square miles, thanks to a helpful fungus « Nothing in Biology Makes Sense!()

  • Hilde Zenil

    I have the same question that Paul (below) has: Is the JoTrPresence txt file available (or some subset of it) for download? I couldn’t find it and would help me work through the example.

    • Victor Hugo Gonzalez

      You can download the data by right-clicking the link and saving the file as *.txt, algo you can open the HTML and copy past the data into a *txt file

  • Francisco

    This post made it very simple, clear and practical.
    Thanks a lot!!!!

  • Victor Hugo Gonzalez

    In want to thank you for your great contribution!, just a question: how can i draw the black overlapping circles of the third map? I Cannot get the same effect. Thank you!

  • Pingback: Making Maps in R, volume 2: ggplots |()

  • Francisco Henao Diaz

    Hi Jeremy

    I have a couple of questions:

    1. One should use environmental variables not correlated between them to eliminate redundancy. This can be done with a correlation test over Bioclim’s variables, but I can not modify your code to eliminate (or keep) the chosen variables. Any idea?

    2. I want to create a binary map from the model. I am modifying the graphical parametters over certain probability threshold but, Is there any objective (better) way to do this? or a rule of thumb?

    2. I am doing community models of distribution, but they performed badly (AUC ~.6). how to improve modelling or it should be in that way given the inter-species niche noise?

    Thanks again for this post.

    • Hi, Francisco — Those are all good questions that get into complex issues of SDM construction and interpretation, and I would strongly recommend you consult the resources cited in the post, as well as more recent published work. In particular,

      1) The documentation for the raster and dismo packages should cover how to subset a “brick” (which is the class of object you get when you query WorldClim with getData).

      2) I am not aware of any “objective” way to define a threshold for a “binary” map from a SDM, but you could check various thresholds by witholding a subset of your data points from the SDM estimation procedure and seeing how many are predicted as presence points at any given threshold. The reliability of predictions from any given SDM, and any illustration of those predictions, can really only be judged in relation to direct information about the organism in question.

      3) I have never done anything with distribution models for multiple species, and would not trust myself to comment on them.