One of the major obstacles to turning your sequence data into phylogenetic trees is choosing (and learning) a tree-building program. Confounding this problem is the fact that most researchers will want to perform numerous, complementary analyses, each of which may require finding, downloading, compiling, learning, and running different programs, requiring different data formats and producing non-compatible output files.
This can be — to put it mildly — a headache. Sometimes, the headache is unavoidable, and the nature of our datasets is a limiting factor. For example, modern high-throughput sequencing methods produce data sets that are so huge and computationally-intensive to analyze that there’s little choice but to use programs optimized for handing these data (e.g. ExaML).
But what if you have more modest goals, and are interested in inferring single gene trees or estimating phylogenies using small concatenated alignments? The programming language R provides numerous packages to run basic phylogenetic analyses in a streamlined, consistent pipeline, and can be a good choice for getting a feel for your data as it allows for all the advantages of its host platform.
What follows is a tutorial on implementing basic treebuilding methods in R, using an example single-locus alignment of 40 mammal species from Joe Felsenstein’s website. Once you’ve whet your appetite (or perhaps hit a snag I don’t address), I’d encourage you to leave a question in the comments, or look at the source documentation and more extensive tutorials linked below.
Alignment File Formats and Conversion
To begin, you’ll need to install two packages that provide the basis for manipulating sequence data in R: ape and phangorn.
setwd("~/Desktop/your_folder/") install.packages("ape") install.packages("phangorn") install.packages("seqinr") library(ape) library(phangorn) library(seqinr)
read.dna() function in the package ape, you’ll import your sequence data, choosing between “interleaved,” “sequential,” “clustal,” and “fasta” formats. From there, you’ll want to convert your alignment to a “phyDat” object for use in phangorn, the major package for treebuilding in R. Once it’s imported, it’s easy to subset your data and only include, say, the first ten taxa:
mammals <- read.dna("mammals.dna", format="interleaved") mammals_phyDat <- phyDat(mammals, type = "DNA", levels = NULL) mammals10 <- subset(mammals_phyDat, 1:10) mammals10_phyDat <- phyDat(mammals10, type = "DNA", levels = NULL)
Model Testing and Distance Matrices
Reducing sequence alignments into a matrix of pairwise distances allows for the rapid estimation of phylogenetic trees (at the cost of the additional information those sequences contain). To produce a distance matrix, however, you’ll need to decide on the model of nucleotide (or protein) evolution that best fits your data, performing a likelihood ratio test.
mt <- modelTest(mammals10) print(mt) dna_dist <- dist.ml(mammals10, model="JC69")
Neighbor Joining, UPGMA, and Maximum Parsimony
Once you have a distance matrix, phangorn provides simple, quick functions for estimating trees from distance matrices using neighbor-joining and UPGMA algorithms, which can be visualized using the
mammals_UPGMA <- upgma(dm) mammals_NJ <- NJ(dm) plot(mammals_UPGMA, main="UPGMA") plot(mammalsNJ, main = "Neighbor Joining")
But which of these trees is a better fit for your data? Using the
parsimony() function, you can compare their respective parsimony scores. The function
optim.parsimony() lets you go a step further, searching treespace through nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR). Alternatively,
pratchet() allows you to perform the search with the parsimony ratchet algorithm.
parsimony(mammals_UPGMA) parsimony(mammals_NJ) mammals_optim <- optim.parsimony(mammals_NJ, mammals) mammals_pratchet <- pratchet(mammals10) plot(mammals_optim) plot(mammals_pratchet)
Maximum Likelihood and Bootstrapping
Though more computationally intensive than distance-matrix methods, building trees with maximum likelihood methods allows you to include the full data from your sequence alignment in a statistical framework that estimates model parameters. You can begin by computing the likelihood of a given tree with the function
pml(). Then, the function
optim.pml() will optimize tree topology and branch length for your selected model of nucleotide evolution.
fit <- pml(mammals_NJ, mammals10) print(fit) fitJC <- optim.pml(fit, model = "JC", rearrangement = "stochastic") logLik(fitJC) bs <- bootstrap.pml(fitJC, bs=100, optNni=TRUE, multicore=TRUE, control = pml.control(trace=0)) plotBS(midpoint(fitJC$tree), bs, p = 50, type="p")
Finally, unless you’re analyzing relationships among a very limited number of taxa, the system viewer in R is probably going to be insufficient for viewing and saving your tree. The function
write.tree() allows you to export the output of your analyses in Newick format, useful for further tinkering in programs such as FigTree.
Beyond what’s touched on here, R and the packages I’ve mentioned (as well as some I haven’t) provide many additional options for inferring and exploring phylogenetic trees. This tutorial is mostly based off the documentation for the ape and phangorn packages, and phangorn’s excellent vignette on treebuilding. The full script for my tutorial can be found on GitHub.