dN(eutralist) < dS(electionist) Part 5

The neutral theory predicts that species with small census (and effective) population sizes are subject to greater drift (or allele frequency fluctuations), and vice versa. In other words, species with larger population sizes are expected to maintain more neutral diversity (polymorphisms). Intuitively then, the efficacy of selection in larger populations could constrain neutral genomic diversity, and vice versa. Little evidence exists however, of the maintenance of levels of neutral diversity due to population size (or drift) alone – “an old riddle” (Leffler et al. 2012), also termed Lewontin’s Paradox (Lewontin 1974). Today’s discussion of the neutralist-selectionist debate borrows thus from several concepts that I have written about over a series of posts – background selection (or negative selection at sites), and hitchhiking (or positive selection, and selective sweeps at sites), both leading to an overall reduction in genomic diversity at linked neutral sites. Diversity reduction is also correlated with site-specific recombination rates (as discussed here).

Species with large census sizes (eg. watermelon, silkmoths), versus small census sizes (orange, olive baboon). Image courtesy: http://dx.doi.org/10.1371/journal.pbio.1002113.g001

Species with large census sizes (eg. watermelon, silkmoth), versus small census sizes (orange, olive baboon). Image courtesy: http://dx.doi.org/10.1371/journal.pbio.1002113.g001


Positive correlation between the impact of natural selection and the geographic range of a species (A) - Figure 2 from Corbett-Detig et al. (2015)

Positive correlation between the impact of natural selection and the geographic range of a species (A) – Figure 2 from Corbett-Detig et al. (2015)


In a recent publication, Corbett-Detig et al. (2015) in quite possibly the largest study of its kind, report strong evidence for the effects of natural selection in maintaining neutral diversity across multiple species. In analyzing variation across windows using genomes from 40 species of plants and animals, of varying census population sizes, the authors (a) call variants (against reference genomes), (b) estimate recombination rates, (c) fit and estimate likelihood under models of background selection, hitchhiking, and neutrality to determine genome-wide reduction in polymorphism, and (d) correlate recombination rates, and impact of selection with proxies for census population sizes (geographic range, and body size). Their analyses indicate strong evidence for the impact of natural selection on reduction of linked neutral diversity in species with large census sizes (eg. invertebrates, herbaceous plants). Conversely, species with small population sizes (eg. vertebrates, woody plants) show greater evidence of genetic drift influencing neutral genomic diversity. Significance of these findings remains when accounting for genome assembly quality, variations in genome size, recombination rates, sampling variance across chromosomes, and polymorphism levels affected by domestication. Their model also predicts that hitchhiking removes more linked neutral diversity in species with greater census population sizes, than background selection, although background selection is more prevalent among all species analyzed.
This study, while concretizing evidence for explanations to Lewontin’s paradox, also discusses violations of the neutral theory for several species (particularly those with large census population sizes).

It is therefore essential to consider selective processes when studying the distribution of genetic diversity within and between species. Incorporating selection into standard population genetic models of evolution will be a central and important challenge for evolutionary geneticists going forward.

Also see the commentary on this paper by Roland Roberts here.
References:
Corbett-Detig RB, Hartl DL, Sackton TB (2015) Natural Selection Constrains Neutral Diversity across A Wide Range of Species. PLoS Biol 13(4):e1002112. doi:10.1371/journal.pbio.1002112
Leffler EM, Bullaughey K, Matute DR, Meyer WK, Ségurel L, et al. (2012) Revisiting an Old Riddle: What Determines Genetic Diversity Levels within Species? PLoS Biol 10(9): e1001388. doi:10.1371/journal.pbio.1001388
Lewontin RC (1974) The genetic basis of evolutionary change. New York: Columbia University Press. xiii, 346 p.
Roberts RG (2015) Lewontin’s Paradox Resolved? In Larger Populations, Stronger Selection Erases More Diversity. PLoS Biol 13(4): e1002113. doi:10.1371/journal.pbio.1002113

Posted in evolution, mutation, natural history, plants, population genetics, theory | Tagged , , | Leave a comment

Migration on the brain

Salmon
If you’ve watched any number of nature shows in your lifetime, you’ve seen the astounding migrations made by salmonid fishes. You can count on seeing a shot of salmon darting against the current and catapulting themselves over turbulent falls (like this!). These migrations between freshwater streams and the ocean are spectacular for both their magnitude and difficulty, but the changes that happen within each fish to get them to migrate in the first place might be just as interesting.

Salmon comparison

The two forms of Oncorhynchus mykiss: the anadromous steelhead (top) and the resident rainbow trout (bottom)


This month’s issue of Molecular Ecology includes a new study from Garrett McKinney and colleagues that compares the gene expression patterns within brains of rainbow trout that are resident or migrant forms. The rainbow trout form that completes long migration events to the ocean and back, called Steelhead or anadromous, undergo striking changes in phenotype to make these journeys. This includes a different body shape, different coloration, and various physiological changes to deal with saltwater. These developmental changes have been previously associated with genetic differences, but little is known about how and when those genetic differences manifest themselves.

Currently, studies of transcriptome-wide patterns of gene expression in salmonids have largely ignored ontogenetic changes during early development and little is known about the timing of activation of molecular pathways that regulate phenotypic differentiation.

McKinney and colleagues generated transcriptomes from the brain tissue of trout that were migratory or resident types. This sampling happened at multiple points over a year, and the authors showed that major differences in gene expression happen at around eight months, especially in males.

The majority of differentially expressed genes between migrants and residents were unique not only to a single time point but also to a single sex, indicating possible temporal differences in gene expression during development and significant sex
differences. This raises the possibility that males and females may be developing at different rates or utilizing different molecular pathways during development.

At eight months old, these fish are still a year away from the big phenotypic differences that aid in migration, but their expression pathways are already cranking up proteins that are specific to those physiological differences. In addition, the authors map these expression differences to previously-documented QTLs and chromosomes that are associated with migration phenotypes.
As with other transcriptome-based research on non-model organisms, the authors are limited in what genes that can actually annotate, so who knows how many undescribed genes are also determining what fish “just keep swimming”.
 
McKinney G.J., Hale M.C., Goetz G., Gribskov M., Thrower F.P. & Nichols K.M. (2015). Ontogenetic changes in embryonic and brain gene expression in progeny produced from migratory and resident Oncorhynchus mykiss , Molecular Ecology, 24 (8) 1792-1809. DOI: http://dx.doi.org/10.1111/mec.13143

Posted in Molecular Ecology, the journal, natural history, RNAseq, transcriptomics | Tagged , | 2 Comments

The gopher tortoise gut microbiome

A gopher. Not a gopher tortoise. From the movie Caddyshack.


A few weeks ago I wrote about a study on socially structured gut microbiomes in wild baboons. Well, now I’m here to tell you about a new study that examined the population structure of tortoise gut microbiomes.
Continue reading

Posted in community ecology, genomics, natural history, next generation sequencing, population genetics, Uncategorized | Tagged , , | 1 Comment

Plastic and evolved responses to host fruit in apple maggot flies

Phil Huntley-Franck bugguide.net

The apple maggot fly, Rhagoletis pomonella, which is so much prettier than its name implies! Photo by Phil Huntley-Franck, bugguide.net


The apple maggot fly, Rhagoletis pomonella, is a prominent system for the study of sympatric speciation. Sister taxa in the R. pomonella species complex, the apple-infesting race of R. pomonella and the snowberry-infesting R. zephyria, have sympatric distributions and the fruiting time of their preferred hosts widely overlaps. However, apple trees and snowberry fruits contain distinct secondary metabolites that have toxic effects on herbivorous insects and may facilitate adaptive divergence in Rhagoletis.
In their new Molecular Ecology paper, Ragland et al. performed reciprocal transplants and measured variation in performance (larval survivorship, larval development time, and pupal mass) and gene expression in fly larvae of the apple-infesting R. pomonella and R. zephyria raised on different host fruit. The aim of the study was to examine the plastic and evolved performance differences between R. pomonella and R. zephyria, which hybridize with low frequency in the field but remain genetically and morphologically distinct.  Continue reading

Posted in adaptation, evolution, speciation, transcriptomics | Leave a comment

Gorillas (genomes) in the mist

Mountain gorillas are an endangered great ape subspecies that number around 800 individuals, inhabiting mountain ranges in central Africa. They have been the subject of numerous field studies, but few genetic analyses have been carried out.
Xue et al. (2015) sequenced whole genomes from wild individuals. Unlike other great apes, mountain gorillas had not been previously studied on a genome-wide scale, despite severe population bottlenecks and reports of phenotypic indicators of inbreeding.

© worldwildlife.org

© worldwildlife.org


Recent declines in the mountain gorilla population have led to rather extensive inbreeding. Xue et al. found that chromosomes were typically homozygous over 1/3 of their length, much more than severely inbred human populations.
Concern has mounted about the survival of mountain gorillas, particularly with regard to human encroachment on their habitat. This is all the more troubling as high levels of inbreeding may render populations less resilient to environmental change and pathogens.

However, the origins of this condition [an increased burden of deleterious mutations and low genetic diversity stemming from several recent generations of inbreeding] extend far into their history, because both eastern subspecies have experience a long decline over tens of millennia.

Xue et al. point to the “unhappy resemblance” between the demographic histories of mountain and eastern lowland gorillas with those the histories inferred from Neandertals before they became extinct.
However, they do discuss the fact that Gorilla subspecies have survived for thousands of generations at very low population levels. They may have developed strategies to avoid inbreeding, such as natal dispersal.
Genomic resources will aid in conservation efforts and future research to aid in preventing mountain gorilla extinction.
References
Xue, Y et al. (2015) Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding. Science 348 (6231): 242-245. doi: 10.1126/science.aaa3952

Posted in bioinformatics, conservation, evolution, genomics, natural history, next generation sequencing, primates | Tagged , , , | Leave a comment

Visualizing Linkage Disequilibrium in R

Patterns of Linkage Disequilibrium (LD) across a genome has multiple implications for a population’s ancestral demography. For instance, population bottlenecks predictably result in increased LD, LD between SNP’s in loci under natural selection affect each others rates of adaptive evolution, selfing/inbreeding populations accumulate LD, etc (for an excellent review, see Slatkin 2008). Patterns of LD thus allow estimation of allele ages, patterns of ancestral admixture, natural selection, among other demographic processes.

In exploring some quick and dirty ways to understand/determine patterns of LD across the genome, here’s a simple tutorial for plotting LD in R (with a little help from PLINK). As with some earlier posts, I am using the 1000 Genome Data as an example – here we explore variation in LD between the CEU (Central European), and YRI (Yoruba) individuals for whom pedigree information is available.

Patterns of linkage disequilibrium across a section of chromosome 6 between the CEU and YRI populations.
Patterns of linkage disequilibrium across a section of chromosome 6 between the CEU and YRI populations.

To obtain PED, MAP files

The 1000 Genomes Project has a neat tool to obtain PED (pedigree), and SNP information files from VCF (variant call format) files – more information on this can be found here. I downloaded PED and “.info” files for the example Chromosome 6 files for the CEU and YRI populations, between chromosome coordinates 6:46620015 and 6:46620998. The “.info” file has only two columns (containing the name of the SNP, and the coordinate) and has to be modified to make a MAP file that can then be used with PLINK to obtain LD statistics.

To do this, edit the file, and add a first column with the chromosome number (here 6), and a third column with 0’s. For more information on MAP files, see this link. Now to calculate LD statistics (here ‘r’), I use PLINK at the command line (in Unix here) with:

$ ./plink --file ceu --noweb --r --allow-no-sex
$ mv plink.ld ceu.ld
$ ./plink --file yri --noweb --r --allow-no-sex
$ mv plink.ld yri.ld

This should produce two files, “ceu.ld”, and “yri.ld”, which contain pairwise LD estimates across all SNP’s in each MAP file. For more information on the “r” statistic, and how to estimate LD across different window lengths in PLINK, I refer you to this page.
Now onto plotting these in R:

yri <- read.table("yri.ld",header=TRUE)
ceu <- -read.table("ceu.ld",header=TRUE)
yri1 <- yri[order(yri$BP_B-yri$BP_A),]
ceu1<-ceu[order(ceu$BP_B-ceu$BP_A),]

plot(yri1$BP_B-yri1$BP_A,yri1$R^2,type="l" ,col="red",ylim=c(0,max(ceu1$R^2,yri$R^2)), lwd=2,xlab="Distance between SNPs (bp)", ylab="Correlation")
points(ceu1$BP_B-ceu1$BP_A,ceu1$R^2,type="l",col="blue",lwd=2)
legend(500,0.15,c("CEU","YRI"),lwd=c(2,2),col=c("blue","red"))

And voila! A simple LD plot – you should be able to play around with cut-off lines, plotting multiple populations, etc. from hereon.

References

Slatkin, Montgomery. “Linkage disequilibrium—understanding the evolutionary past and mapping the medical future.” Nature Reviews Genetics 9.6 (2008): 477-485.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.

Posted in bioinformatics, howto, population genetics, R | Tagged , , | 1 Comment

Don't trust your data: reviewing Bioinformatics Data Skills

14578025768_1bc0f5374c_h
Image by Tau Zero

The Molecular Ecologist receives a small commission for purchases made on Bookshop.org via links from this post.

There is little debate on the importance of bioinformatics for the present and future of science. As molecular ecologists, we are likely more aware of this than most disciplines due to the data explosion that has accompanied the wide application of next-generation sequencing methods. However, many of you (like me!) might be caught in an awkward area of bioinformatics expertise: too late to have these basics included in your undergraduate/graduate courses and too early to hire a freelance bioinformatician with your fat grant.

So there you are, staring at a bunch of fasta files wondering how to use someone’s poorly-documented python scripts. Maybe you have a tear in your eye and worry in your heart. You want to get your data from point A to B, but you realize that you are trying to use a tool without any understanding of its underlying concepts and the whole thing is written in a foreign language. Imagine implementing an ANOVA without any understanding of the normal distribution and all the software menus are in Russian.

Continue reading
Posted in bioinformatics, book review, genomics, software | Tagged | 2 Comments

A call for statistical editors in ecology

Pop quiz for all you potential “statistical editors” out there: What’s this?


A new article in TREE wants to add a specialized reviewer to the peer review process. von Wehrden, Schultner, and Abson suggest that a statistical editor would expedite* the peer review process:

“The review process of a manuscript with imperfect statistics typically takes several months, while a statistical editor could return the manuscript to authors within days or weeks”

It’s a noble idea, but I don’t think that it will work. Continue reading

Posted in Uncategorized | 3 Comments

Molecular Ecology's best reviewers 2015

A laurel wreath in the middle of nowhere(Flickr: Kathrin & Stefan Marks)
As a continuation of our post from last year, Molecular Ecology is publishing a list of our very best referees from the last two years (2013 and 2014). Our hope is that the people listed below will put ‘Top Reviewer for Molecular Ecology 2015’ on their resume, and that this will highlight to search committees and granting agencies that they have made a significant contribution to the community as a reviewer.
Everyone who completed a review for Molecular Ecology between 1st January 2013 and 31st December 2014 was eligible, and people were ranked by an index that included the number of reviews completed, the proportion of accepted review requests that led to a review being returned (excluding unassignments before the two week deadline), and the average time taken per review if this was over two weeks. The top 300 (~ 8%) are listed below the fold – thank you so much for your efforts!
Continue reading

Posted in housekeeping, Molecular Ecology, the journal, peer review, science publishing | Tagged , , | Leave a comment

A transcriptomic approach for reduced representation in population genomics

Purple sea urchin

The purple sea urchin, Strongylocentrotus purpuratus, the focal taxon of Pespeni et al. 2013


 
 
 
 
 
 
 
 
 
 
Many population genomics studies use methods that provide a reduced representation of the genome, for example RADseq or UCEs. Targeting a subset of the genome reduces the cost of sequencing and makes data analyses less computationally intensive. In their recent Molecular Ecology paper, De Wit, Pespeni, and Palumbi (2015) suggest using expressed gene sequences (i.e. transcriptomic data) as a reduced representation method for population genomic studies. The review walks through the strengths (greater accuracy of functional annotation) and weaknesses (potential biases such as allele-specific expression) of this approach.

This review summarizes current methods, identifies ways that using expressed sequence data benefits population genomic inference, and explores how current practitioners evaluate and overcome challenges that are commonly encountered. We focus particularly on the additional power of functional analysis provided by expressed sequence data and how these analyses push beyond allele pattern data available from non-function genomic approaches.

The review contains sections on 1) assembly quality, 2) marker development and genotyping after assembling the transcriptome, and 3) applications of expressed sequence datasets. Based on a summary of the current literature, the authors also provide a “best-practice” pipeline that starts with experimental design and sequencing platform choice and moves to raw data quality control, transcriptome assembly and evaluation, ending with a reference transcriptome ready for SNP discovery. Below I will highlight a few points I found to be interesting or useful from each section and conclude with a few new questions the review raised in my mind.
Assembly quality
Poorly assembled transcriptomes can result in the creation of false SNPs, where nucleotide differences in paralogous sequences are mistaken for polymorphisms, and the omission of real SNPs if allelic differences are considered as belonging to two separate genes instead of one. These errors can be reduced using tBLASTn searches querying your translated sequences against a high quality protein database from a closely related species, which will reveal incorrectly collapsed contigs when multiple orthologous proteins from the reference match a single collapsed contig. In the absence of a well annotated reference, the authors suggest a conservative approach that keeps allelic variants as separate “genes.” This will reduce the number of false positives, albeit at the expense of potentially missing real SNPs.
Normalizing libraries to correct for extremely highly expressed genes reduces the number of sequencing errors. Normalization can be done bioinformatically after sequencing with programs such as khmer. However, the effect of digital normalization on qualitative measurements of assembly (such as percent of conserved orthologs identified) has not yet been tested. Normalizing the libraries during the library prep stages will also reduce overly expressed transcripts, thereby increasing the representation of lowly expressed and potentially interesting transcripts. However, normalization at this stage prevents the collection of quantitative information about transcript abundance making expression analyses impossible.
Marker development and genotyping
Population genomics studies often use pooled population-wide data to directly estimate allele frequencies. Ideally, both alleles at a given locus in a heterozygous individual would be equally represented in the data, as would each individual in a pooled sample. However, data are often skewed for technical or biological reasons, such as PCR artifacts or allele-specific expression (when one allele is more highly expressed than the other), respectively. There is currently an active debate how these factors could potentially mislead genotype estimates. De Wit et al. advocate estimating allele-specific expression before analyzing pooled data and outline several approaches to do so.
Applications

The red abalone, Haliotis rufescens, the focal taxon of De Wit et al. 2014. Photo courtesy of Kevin Lee, asnailodessey.com

The red abalone, Haliotis rufescens, the focal taxon of De Wit et al. 2014. Photo courtesy of Kevin Lee, asnailodyssey.com


 
 
 
 
 
 
 
 
 
Population genomics is often interested in finding outlier loci under selection. Selection may act on many loci within regulatory or metabolic networks in such a way that individual loci fail to meet the stringent significance thresholds of outlier tests. However, De Wit et al. point out that by testing whether loci with high Fst are non-randomly clustered into functional categories, it is possible to infer the role of selection despite the absence of individually significant loci (for examples, see Pespeni et al. 2013 and De Wit et al. 2014).
A traditional use of transcriptomic data is to examine differences in gene expression among populations, which often play a role in population divergence and local adaptation. Combining expression data (determined from read counts) with SNP genotypes (determined from read sequences) allows for the examination of the functional role of SNPs in gene expression.
Final questions
As a scientist whose past and current research has focused on understanding the mechanisms responsible for allele frequency and gene expression differences among populations, it is extremely appealing to envision generating a single dataset that can be used to test both sets of questions. In their best practices pipeline, De Wit et al. suggest using sequence data collected across many different developmental phases, tissue types, sexes, and physiological stages for transcriptome assembly. They also suggest collecting 30-100 million reads to balance the coverage needed for gene discovery and downstream gene expression analyses with the accumulation of sequencing errors.
Determining how many and which individuals to sequence and the necessary depth of coverage are universal decisions for all NGS projects and here De Wit et al. outline the gold standard experimental design for projects using transcriptomic data for population genomics. Ideally, all our experiments would meet this gold standard but resources are often limiting and this review made me wonder: can a single experimental design provide the data required to test hypotheses about population genomics AND differential gene expression? Or does one question have to take a back seat? How far we can deviate from the best practices outlined here before we lose confidence in our results? Furthermore, in which questions (gene expression vs allele frequency differences) would we lose power to answer first?
De Wit et al. clearly demonstrate that determining SNP genotypes and differential gene expression simultaneously from transcriptomic data shows much promise for evolutionary ecology and I am excited to see how this approach evolves over time.
Reference
De Wit, P., Pespeni, M. H., & Palumbi, S. R. (2015). SNP genotyping and population genomics from expressed sequences – current advances and future possibilities. Molecular Ecology. DOI: 10.1111/mec.13165

Posted in genomics, howto, methods, Molecular Ecology, the journal, next generation sequencing, RNAseq | Leave a comment