Phylogeny-aware comparisons of microbial communities – EdgePCA and Squash Clustering

I’m jumping on the bandwagon with a blog post about this new PLoS ONE paper (taking the lead from the man in charge in my lab) because the algorithms are just so exciting:
Matsen FA IV, Evans SN. (2013) Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLoS ONE. 8(3):e56859.
Our lab works closely with Erick Matsen and his group at the FHCRC – we’ve implemented their software for phylogenetic placement and community comparisons of short read data (pplacer and guppy) into our in-house pipeline for phylogenetic analysis of environmental metagenomes (PhyloSift). The two algorithms they discuss in this new PLOS ONE paper, Edge PCA and Squash Clustring, are implemented within the guppy software package. I can vouch for the usability of the Matsen group’s software – it is well documented and typically pretty easy to install, so I suggest you try it out if you’re as excited as I am by the work described in the above-mentioned paper.
Now for the good stuff – what do EdgePCA and Squash Clustering do? Conceptually, they represent alternatives to traditional PCoA/MDS analysis and UPGMA clustering, respectively. The UniFrac algorithm (as implemented in QIIME) currently represents the default approach for carrying out these traditional ecological analyses on high-throughput rRNA amplicon datasets. However, although UniFrac uses a phylogenetic tree as input, it is still fundamentally a distance-based metric:

Once distances have been computed between samples using UniFrac, these distances are typically fed into general-purpose ordination and clustering methods, such as principal coordinates analysis and UPGMA. Although it is appropriate to apply such techniques to distance matrices of this sort, the classical methods do not use the fact that the underlying distances were calculated in a specific manner, namely, on a phylogenetic tree. Consequently, in an application of principal components analysis, it is difficult to describe what the axes represent. Similarly, in hierarchical clustering, it is unclear what is driving a certain agglomeration step; although it can be explained in terms of an arithmetic operation, a certain amount of interpretability in the original phylogenetic setting is lost. [Matsen & Evans 2013]

Personally, I find EdgePCA and Squash Clustering to be more intuitive, because you can visualize and explore community patterns on the tree topology itself. In EdgePCA, lineages that drive community differences in each principal component are visualized as colored and fattened branches in the reference tree:

Figure 4. The first principal component for the combined vaginal data, representing about 56 percent of the variance. The reference tree is colored by principal component sign (positive colored orange, negative colored green) and thickened proportional to magnitude. The edges across which maximal between-sample hetero- geneity is found are those leading to the Lactobacillus clade and those leading to the Sneathia and Prevotella clade. This axis corresponds to taxa that are important in the diagnosis of bacterial vaginosis, as Sneathia and Prevotella are associated with bacterial vaginosis, while Lactobacillus is associated with a healthy microbiome. [Matsen & Evans 2013]

Figure 4. The first principal component for the combined vaginal data, representing about 56 percent of the variance. The reference tree is colored by principal component sign (positive colored orange, negative colored green) and thickened proportional to magnitude. The edges across which maximal between-sample hetero- geneity is found are those leading to the Lactobacillus clade and those leading to the Sneathia and Prevotella clade. This axis corresponds to taxa that are important in the diagnosis of bacterial vaginosis, as Sneathia and Prevotella are associated with bacterial vaginosis, while Lactobacillus is associated with a healthy microbiome. [Matsen & Evans 2013]

And different principal components in EdgePCA can tell different stories. Figures 4 & 5 in the paper represent data from a vaginal microbiome study–you’ll notice that the first principal component in Figure 4 (representing 56% of the variance) shows general trends over the whole reference tree, while the second principal component Figure 5 (below, representing 24% of the variance) nails down more fine-scale differences observed between two lineages in the Lactobacillus clade:
Figure 5. The second principal component for the combined vaginal data, representing about 24 percent of the variance. Low-weight regions of the tree are excluded from the figure. The edges across which maximal between-sample heterogeneity is found are those between two different Lactobacillus clades: L. iners and L. crispatus. Thus, the second important ‘‘axis’’ appears to correspond to the relative levels of these two species. [Matsen & Evans 2013]

Figure 5. The second principal component for the combined vaginal data, representing about 24 percent of the variance. Low-weight regions of the tree are excluded from the figure. The edges across which maximal between-sample heterogeneity is found are those between two different Lactobacillus clades: L. iners and L. crispatus. Thus, the second important ‘‘axis’’ appears to correspond to the relative levels of these two species. [Matsen & Evans 2013]

Squash Clustering, on the other hand, is a way of comparing “phylogentic fingerprints” of microbial communities, to see how similar or different they may be. Matsen & Evans give a good analogy:

Imagine that the phylogenetic tree is a road network and that each sample is represented by the distribution of a unit of mass into piles of dirt along this road network. The distance between two samples is then defined to be the minimal amount of ‘‘work’’ required to move the dirt in the first configuration to that in the second configuration (in this context the amount of work needed to move an infinitesimal mass d a distance x is defined to be d:x). Thus, similar collections of phylogenetic placements result in similar dirt pile configurations that don’t require much mass movement to transform one into the other, while quite different collections of placements require that significant amounts of mass must move long distances across the tree. This distance is classical, having roots in 18th century mathematics, and is a generalization of the weighted UniFrac distance. [Matsen & Evans 2013]

In practice, Squash Clustering looks at the placement of reads across the reference tree for sample 1 vs sample 2 (and so on for as many samples in your datasets):

Figure 2. A visual depiction of the squash clustering algorithm. When two clusters are merged, their mass distributions are combined according to a weighted average. The edges of the reference tree in this figure are thickened in proportion to the mass distribution (for simplicity, just a subtree of the reference tree is shown here). In this example, the lower mass distribution is an equal-proportion average of the upper two mass distributions. Similarities between mass distribu- tions, such as the similarity seen between the two clusters for the G. vaginalis clade shown here, are what cause clusters to be merged. Such similarities between internal nodes can be visualized for the squash clustering algorithm; the software implementation produces such a visualization for every internal node of the clustering tree. Note that in this figure only the number of reads placed on each edge is shown, although each placement has an associated location on each edge when performing computation. [Matsen & Evans 2013]

Figure 2. A visual depiction of the squash clustering algorithm.
When two clusters are merged, their mass distributions are combined according to a weighted average. The edges of the reference tree in this figure are thickened in proportion to the mass distribution (for simplicity, just a subtree of the reference tree is shown here). In this example, the lower mass distribution is an equal-proportion average of the upper two mass distributions. Similarities between mass distribu- tions, such as the similarity seen between the two clusters for the G. vaginalis clade shown here, are what cause clusters to be merged. Such similarities between internal nodes can be visualized for the squash clustering algorithm; the software implementation produces such a visualization for every internal node of the clustering tree. Note that in this figure only the number of reads placed on each edge is shown, although each placement has an associated location on each edge when performing computation. [Matsen & Evans 2013]

And then gives you back a tree (where each tip represents a sample) showing how the communities present in each sample are related to each other:
Figure 8. The results of (a) squash clustering as applied to the vaginal data. [Matsen & Evans 2013]

Figure 8. The results of (a) squash clustering as applied to the vaginal data. [Matsen & Evans 2013]

There are currently several limitations of EdgePCA and Squash clustering, both related to  taxon sampling in your reference phylogeny.

  • First, you lose resolution if a clade (a taxon of interest) is not represented amongst the sequences you use to build your reference tree. In the the vaginal dataset presented by Matsen & Evans (Figure 5, above), the second principal component, accounting for 24% of variance between samples, was driven by two species of Lactobacillus (a highly sampled clade in 16S phylogenies, due to the importance of this genus in human health). You just wouldn’t see this fine scale variance in a dataset from a much less characterized environment (the deep sea, for example) because we just don’t have enough representative taxa in public rRNA databases like SILVA.
  • Similarly, the distribution of taxon sampling across a tree will currently bias the computation of community comparisons, because as Matsen & Evans state, “more highly sampled lineages will be assigned comparatively more weight in the PCA analysis than less sampled lineages.” A sparsely sampled clade might have one long, deep branch with one leaf, whereas a well-sampled clade will have many taxa and therefore many leaves and internal branches. Because EdgePCA works on edges (branches), you’ve just got more edges to work with in well-sampled clades.

And finally, the paper hints at other exciting things on the horizon, including:

  • A manuscript in preparation that looks at how reads from different regions of rRNA genes (16S in this case) affect phylogeny-based clustering and ordination methods.
  • Further modifications to the edge PCA algorithm which reduces biases stemming from taxon sampling in your reference tree.

Holly Bik is postdoctoral researcher in marine genomics, working in Jonathan Eisen’s lab at the University of California Davis.

This entry was posted in bioinformatics, genomics, next generation sequencing, software and tagged , , , , , , , . Bookmark the permalink.