Mixed modeling of methylation measures (increase your power by 60%)

Do you want to increase your power to detect differentially methylated CpG sites by 60%*? Yes?! Then do I have the pre-print for you.

Recently genomic advances, such as reduced representation bisulfite sequencing (RRBS), have allowed researchers to measure genome-wide DNA methylation at base-pair resolution. The statistical tools to analyze these data, however, have been trying to keep up with these technological advances.
DNA methylation data generated by RRBS (or WGBS and BS-seq) come in the form of a count of the number of methylated and unmethylated cytosines at any given position in the genome. Until now, you could analyze these data in one of two ways:
1) Convert the count into a proportion (methylated cytosines / methylated + unmethylated cytosines) and model using linear models.
2) Model the count data directly using a beta-binomial model, which also takes into account the well-known overdispersion of sequencing data.
All else being equal, you should always choose door #2 – the beta-binomial model. This is because it allows you to retain information on how confident you are in your measurement of methylation at any given base. If you convert the all data to proportions and use linear models, all sites are created equal. In other words:

a site at which 5 of 10 reads are designated as methylated (i.e., read as a cytosine) is treated identically to a site at which 50 of 100 reads are designated as methylated.

So modeling using a beta-binomial seems to be the best approach, and a few papers have shown that they afford a more powerful analysis. Unfortunately, what they don’t afford is flexibility. Specifically, the flexibility to control for population structure or relatedness – something that is present in pretty much any study of wild populations:

In humans, methylation levels at more than ten thousand CpG sites are influenced by local genetic variation, and DNA methylation levels in whole blood are 18%-20% heritable on average… As a result, DNA methylation levels will frequently covary with genetic relatedness (either kinship or population structure), and failure to account for this covariance could lead to spurious associations or reduced power to detect true effects.

Lea et al. developed a new binomial mixed model, “MACAU“, that directly models count data while controlling for for relatedness (or population) structure. The key here is the word “mixed” – the authors use random effects to model relatedness (or population) structure in concert with the overdispersion.
The authors compared their model, MACAU, to linear, linear mixed-effects, and beta-binomial models using both simulated and real data sets (from baboons and Arabidopsis). In almost* all of the comparisons their binomial mixed model had both increased sensitivity and specificity relative to the other models.

Taken together, our results indicate that MACAU is an effective tool for analyzing bisulfite sequencing data, with particular salience to analyses of structured populations.

So, if you’re planning on to use any bisulfite-conversion method to measure genome-wide DNA methylation in your study species or population of interest, you should probably get familiar with this method. In fact, MACAU could be useful if you’re planning on generating any sort of count-based, binomial data in your study population of interest:

Although we developed MACAU with the analysis of bisulfite sequencing data in mind, we note that a count-based binomial mixed model may be an appropriate tool in other settings as well. For example, allele-specific gene expression is often measured in RNA-seq data by comparing the number of reads originating from a given variant to the total number of mapped reads for that site.

Lea, A. J., Alberts, S. C., Tung, J. & Zhou, X. 2015. A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data. Cold Spring Harbor Labs Journals.
MACAU is freely available at www.xzlab.org/software.html.
*Results may vary. Lea et al were able to detect 1.6-fold more age-associated methylated sites than the next-best approach, a beta-binomial model.
**The authors found that “MACAU exhibit[ed] only a slight loss of power relative to a beta-binomial model without random effects when h2 = 0, while conferring better power and better test statistic
calibration when h2 > 0”

This entry was posted in bioinformatics, genomics, methods, next generation sequencing, software and tagged , , . Bookmark the permalink.