Metabarcoding for every body, every habitat, every time

The immediate reason why I wanted to write about Boosting DNA metabarcoding for biomonitoring with phylogenetic estimation of operational taxonomic units’ ecological profiles is its usefulness for the scientific community and the effort of the authors to make their study reproducible. All their code and data are online and publicly available. It was even accessible before the paper got accepted!

Secondly, I like observing how the field of metabarcoding eukaryotes is developing. I feel at home in the field of metabarcoding prokaryotes, that is archaea and bacteria. Hence, many of the approaches are familiar. People who study prokaryotes and very small eukaryotes could not just go out and observe their objects interacting in the wild. Because these tiny microorganisms are difficult to see. One very successful way to work around this obstacle was to use genetics and study microbial DNA instead. The 16S rRNA gene has become extremely useful to bacteriologists. This gene is part of the prokaryotic ribosome. It is a very essential gene – most prokaryotic organisms have it. Highly conserved regions of this genes serve as primer binding sites for universal primers and variable regions in between them can be used to reconstruct bacterial phylogenies. 16S amplicons give us a glance at the distribution of bacteria and archaea, the oldest, most diverse and abundant species on this earth (more about this here). Most environments, from remote volcanic hot springs to human genitalia have been characterized using next-generation sequencing of 16S amplicons. While most people refer to it as 16S amplicon studies, it is basically a form of metabarcoding. Eukaryotes have a similar gene, the 18S. Hence, people studying small eukaryotes like fungi and protists have been using 18S amplicon studies for the same purpose.

Only recently, it has become very fashionable to apply metabarcoding techniques to larger multicellular organisms and entire systems. I am excited to see how this scientific community is adopting methods from 16S amplicon microbiologists and bringing them to a new level. The paper by François Keck et al. that came out in Molecular Ecology Resources is a big step towards this direction. I hope that the metabarcoding leaders and whiz kids learn from the success and failures of microbiologists without re-inventing the wheel.

The application of DNA barcoding to metazoans has fundamentally changed the way we can assess biodiversity. For a more detailed description of this field read this recent blogpost by Katharine Coykendall. Traditionally, biodiversity assessments were done by taxonomic experts during lengthy field expeditions. However, the development of barcoding methods for whole communities to identify multiple taxa simultaneously by DNA sequencing has enabled the generation of fast and cost-effective datasets. DNA can be collected from the environment (water, pollen, blood, soil samples, etc.) to find traces of its inhabitants. Through time this data can be used for biomonitoring changes in time and space.

For the process of metabarcoding, a universal gene that can be found in the majority of the taxa of interest, is amplified from an environmental sample. Several such samples can be tagged and pooled. After sequencing, all amplified reads are sorted, filtered for quality and length, and then compared to a reference of known sequences. Here lies one of the biggest drawbacks of this technique. A sample is only as good as its reference. Most reference libraries are incomplete. Keck, the main author of the paper discussed here, has proposed to circumvent this problem by working with the reads directly instead of comparing them to a reference and using the assigned taxonomy (Keck, Rimet, Bouchez, et al. 2016; Keck, Rimet, Franc, et al. 2016; Keck et al. 2017). He and his co-authors suggest clustering the reads into similar sequences (operational taxonomic units; OTU) and then incorporating them into a known phylogenetic tree, a reference phylogeny. This tree contains well-described taxa and their ecological profiles. Hence, short reads from the environment can be placed within a phylogenetic tree and their ecological roles are inferred based on the identities and roles of their closest neighbors. This method is termed OTU-PITI for OTU-phylogenetic insertion and trait imputation (my auto-correct translates ‘OTU’ ALWAYS into ‘out’! Very aggravating during manuscript writing. More about this below.).

Image of diatoms
by Wi Peter from Wikimedia Commons

The new method was tested using eDNA extracted from French river water samples containing diatoms. Diatoms are actually unicellular microorganisms belonging to the eukaryotes. Diatoms encompass a large taxonomic diversity and they respond strongly to habitat quality. That is why they are often used as indicator species for environmental change. The rbcL gene was amplified, a gene coding for the RuBisCo enzyme. This enzyme is recognized as a powerful marker to differentiate among diatom species and to resolve their phylogenetic relationships. A published phylogeny based on seven genes (including the whole length rbcL) was used as reference tree. Shorter amplicons (~300bp) of the rbcL gene were then placed in this tree. As ecological signal in this study, pollution sensitivity and indicator values (IPSV) of the different diatom species were collated from the literature and the IPS database. Using traditional microscopy as a comparison, the authors found that their method performed substantially better at biomonitoring the environment. Moreover, it was also faster and cheaper.
The method assumes that ecological signals can be estimated from the phylogenetic position in the tree. That means that traits among species depend on their phylogenetic relatedness. Here the authors cite Darwin’s principle of descent with modifications indirectly through Felsenstein (1985).

Reference tree of 236 diatom species for which both phylogenetic position and IPSS values were available.

Keck’s method reminded me strongly of PICRUSt (pronounced ‘pie crust’), a bioinformatic software package designed to predict metagenome functions in bacteria from marker genes (16S rRNA). PICRUSt also relies on the assumption that phylogeny and function are sufficiently linked to make predictions about a community’s functional potential (Langille et al. 2013). While PICRUSt is restricted to the 16S rRNA gene, Keck’s method can be extended to any known marker genes depending on which group of organisms one wishes to study. Moreover, Keck and co-authors provide their clean code with sufficient commenting on GitHub.

Their method can be considered taxonomy-free because they do not blast their short reads against a database of reference taxa. Instead, they are using phylogenetic information. Phylogenetic distances, such as UniFrac which is widely used in microbial ecology and has been adopted in metabarcoding, can greatly improve the accuracy of community similarity and dissimilarity matrices (Lozupone and Knight 2005). Keck et al. expanded our toolbox to compare communities from incorporating phylogenetic information to using this information to also learn about ecological functions.

The growing metabarcoding community impresses me with their open, friendly and supportive attitude. Many scientists are now using metabarcoding to answer ecological questions.
I have not seen much discussion about the usefulness of OTUs in this community and wonder whether people are aware of the advantages of using amplicon sequence variants (ASVs), promoted by Benjamin Callahan et al. (Callahan, McMurdie, and Holmes 2017) but originally proposed as ‘Oligotyping’ by Meren et al. (Eren et al. 2013)? They basically suggest working with sequences without clustering them into OTUs. Sequences must first be cleaned to remove sequencing errors. Then, variants can be resolved down to single-nucleotide differences. This leads to much greater resolution. Moreover, ASVs show intrinsic biological meaning identified independently from any reference database. On the contrary, to build OTUs, sequences are either compared to a reference database to assign taxonomy (closed-reference) or they are grouped first into clusters based on their pairwise sequence similarities (open-reference). Similarity thresholds are chosen arbitrarily, usually it is assumed that a 97% similarity is close enough to belong to the same species in common 16S rRNA amplicons. However, when sequences are clustered at this level, it is possible that two sequences in the same group show a 6% difference to each other, while showing only a 3% difference to their reference. The construction of open-reference OTUs depends on the relative abundance of the sampled communities, as a result, two open-reference OTU datasets cannot be compared before re-analyzing them and building new open-reference OTUs of both datasets combined. ASVs are reusable among studies, reproducible across datasets, and not limited by incomplete reference databases. They allow for more accurate diversity measures, applications across environments and meta-analyses.


Callahan, Benjamin J., Paul J. McMurdie, and Susan P. Holmes. 2017. “Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker-Gene Data Analysis.” The ISME Journal 11 (12): 2639–43.

Eren, A. Murat, Loïs Maignien, Woo Jun Sul, Leslie G. Murphy, Sharon L. Grim, Hilary G. Morrison, and Mitchell L. Sogin. 2013. “Oligotyping: Differentiating between Closely Related Microbial Taxa Using 16S rRNA Gene Data.” Methods in Ecology and Evolution / British Ecological Society 4 (12).

Felsenstein, Joseph. 1985. “Phylogenies and the Comparative Method.” The American Naturalist 125 (1): 1–15.

Keck, François, Frédéric Rimet, Agnès Bouchez, and Alain Franc. 2016. “Phylosignal: An R Package to Measure, Test, and Explore the Phylogenetic Signal.” Ecology and Evolution 6 (9): 2774–80.

Keck, François, Frédéric Rimet, Alain Franc, and Agnés Bouchez. 2016. “Phylogenetic Signal in Diatom Ecology: Perspectives for Aquatic Ecosystems Biomonitoring.” Ecological Applications: A Publication of the Ecological Society of America 26 (3): 861–72.

Keck, François, Valentin Vasselon, Kálmán Tapolczai, Frédéric Rimet, and Agnès Bouchez. 2017. “Freshwater Biomonitoring in the Information Age.” Frontiers in Ecology and the Environment 15 (5): 266–74.

Langille, Morgan G. I., Jesse Zaneveld, J. Gregory Caporaso, Daniel McDonald, Dan Knights, Joshua A. Reyes, Jose C. Clemente, et al. 2013. “Predictive Functional Profiling of Microbial Communities Using 16S rRNA Marker Gene Sequences.” Nature Biotechnology 31 (9): 814–21.

Lozupone, Catherine, and Rob Knight. 2005. “UniFrac: A New Phylogenetic Method for Comparing Microbial Communities.” Applied and Environmental Microbiology 71 (12): 8228–35.

This entry was posted in bioinformatics, community, community ecology, DNA barcoding, fieldwork, metagenomics, next generation sequencing, phylogenetics, R and tagged , , , , , , . Bookmark the permalink.