Plastic and evolved responses to host fruit in apple maggot flies

Phil Huntley-Franck bugguide.net

The apple maggot fly, Rhagoletis pomonella, which is so much prettier than its name implies! Photo by Phil Huntley-Franck, bugguide.net

The apple maggot fly, Rhagoletis pomonella, is a prominent system for the study of sympatric speciation. Sister taxa in the R. pomonella species complex, the apple-infesting race of R. pomonella and the snowberry-infesting R. zephyria, have sympatric distributions and the fruiting time of their preferred hosts widely overlaps. However, apple trees and snowberry fruits contain distinct secondary metabolites that have toxic effects on herbivorous insects and may facilitate adaptive divergence in Rhagoletis.

In their new Molecular Ecology paper, Ragland et al. performed reciprocal transplants and measured variation in performance (larval survivorship, larval development time, and pupal mass) and gene expression in fly larvae of the apple-infesting R. pomonella and R. zephyria raised on different host fruit. The aim of the study was to examine the plastic and evolved performance differences between R. pomonella and R. zephyria, which hybridize with low frequency in the field but remain genetically and morphologically distinct.  Continue reading

RedditDiggMendeleyPocketShare and Enjoy
Posted in adaptation, evolution, speciation, transcriptomics | Leave a comment

Gorillas (genomes) in the mist

Mountain gorillas are an endangered great ape subspecies that number around 800 individuals, inhabiting mountain ranges in central Africa. They have been the subject of numerous field studies, but few genetic analyses have been carried out.

Xue et al. (2015) sequenced whole genomes from wild individuals. Unlike other great apes, mountain gorillas had not been previously studied on a genome-wide scale, despite severe population bottlenecks and reports of phenotypic indicators of inbreeding.

© worldwildlife.org

© worldwildlife.org

Recent declines in the mountain gorilla population have led to rather extensive inbreeding. Xue et al. found that chromosomes were typically homozygous over 1/3 of their length, much more than severely inbred human populations.

Concern has mounted about the survival of mountain gorillas, particularly with regard to human encroachment on their habitat. This is all the more troubling as high levels of inbreeding may render populations less resilient to environmental change and pathogens.

However, the origins of this condition [an increased burden of deleterious mutations and low genetic diversity stemming from several recent generations of inbreeding] extend far into their history, because both eastern subspecies have experience a long decline over tens of millennia.

Xue et al. point to the “unhappy resemblance” between the demographic histories of mountain and eastern lowland gorillas with those the histories inferred from Neandertals before they became extinct.

However, they do discuss the fact that Gorilla subspecies have survived for thousands of generations at very low population levels. They may have developed strategies to avoid inbreeding, such as natal dispersal.

Genomic resources will aid in conservation efforts and future research to aid in preventing mountain gorilla extinction.

References

Xue, Y et al. (2015) Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding. Science 348 (6231): 242-245. doi: 10.1126/science.aaa3952

Posted in bioinformatics, conservation, evolution, genomics, natural history, next generation sequencing, primates | Tagged , , , | Leave a comment

Visualizing Linkage Disequilibrium in R

Patterns of Linkage Disequilibrium (LD) across a genome has multiple implications for a population’s ancestral demography. For instance, population bottlenecks predictably result in increased LD, LD between SNP’s in loci under natural selection affect each others rates of adaptive evolution, selfing/inbreeding populations accumulate LD, etc (for an excellent review, see Slatkin 2008). Patterns of LD thus allow estimation of allele ages, patterns of ancestral admixture, natural selection, among other demographic processes.

In exploring some quick and dirty ways to understand/determine patterns of LD across the genome, here’s a simple tutorial for plotting LD in R (with a little help from PLINK). As with some earlier posts, I am using the 1000 Genome Data as an example – here we explore variation in LD between the CEU (Central European), and YRI (Yoruba) individuals for whom pedigree information is available.

Patterns of linkage disequilibrium across a section of chromosome 6 between the CEU and YRI populations.

Patterns of linkage disequilibrium across a section of chromosome 6 between the CEU and YRI populations.

To obtain PED, MAP files: The 1000 Genomes Project has a neat tool to obtain PED (pedigree), and SNP information files from VCF (variant call format) files – more information on this can be found here. I downloaded PED and “.info” files for the example Chromosome 6 files for the CEU and YRI populations, between chromosome coordinates 6:46620015 and 6:46620998. The “.info” file has only two columns (containing the name of the SNP, and the coordinate) and has to be modified to make a MAP file that can then be used with PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/) to obtain LD statistics. To do this, edit the file, and add a first column with the chromosome number (here 6), and a third column with 0’s. For more information on MAP files, see this link. Now to calculate LD statistics (here ‘r’), I use PLINK at the command line (in Unix here) with:

$ ./plink --file ceu --noweb --r --allow-no-sex
$ mv plink.ld ceu.ld 
$ ./plink --file yri --noweb --r --allow-no-sex
$ mv plink.ld yri.ld

This should produce two files, “ceu.ld”, and “yri.ld”, which contain pairwise LD estimates across all SNP’s in each MAP file. For more information on the “r” statistic, and how to estimate LD across different window lengths in PLINK, I refer you to this page. Now onto plotting these in R:

yri<-read.table("yri.ld",header=TRUE)
ceu<-read.table("ceu.ld",header=TRUE)
yri1<-yri[order(yri$BP_B-yri$BP_A),]
ceu1<-ceu[order(ceu$BP_B-ceu$BP_A),]
plot(yri1$BP_B-yri1$BP_A,yri1$R^2,type="l"
,col="red",ylim=c(0,max(ceu1$R^2,yri$R^2)),
lwd=2,xlab="Distance between SNPs (bp)", ylab="Correlation")
points(ceu1$BP_B-ceu1$BP_A,ceu1$R^2,type="l",col="blue",lwd=2)
legend(500,0.15,c("CEU","YRI"),lwd=c(2,2),col=c("blue","red"))

And voila! A simple LD plot – you should be able to play around with cut-off lines, plotting multiple populations, etc. from hereon.

References:

Slatkin, Montgomery. “Linkage disequilibrium—understanding the evolutionary past and mapping the medical future.” Nature Reviews Genetics 9.6 (2008): 477-485.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.

Posted in bioinformatics, howto, population genetics, R | Tagged , , | 1 Comment

Don’t trust your data: reviewing Bioinformatics Data Skills

14578025768_1bc0f5374c_h

Image by Tau Zero

There is little debate on the importance of bioinformatics for the present and future of science. As molecular ecologists, we are likely more aware of this than most disciplines due to the data explosion that has accompanied the wide application of next-generation sequencing methods. However, many of you (like me!) might be caught in an awkward area of bioinformatics expertise: too late to have these basics included in your undergraduate/graduate courses and too early to hire a freelance bioinformatician with your fat grant.

So there you are, staring at a bunch of fasta files wondering how to use someone’s poorly-documented python scripts. Maybe you have a tear in your eye and worry in your heart. You want to get your data from point A to B, but you realize that you are trying to use a tool without any understanding of its underlying concepts and the whole thing is written in a foreign language. Imagine implementing an ANOVA without any understanding of the normal distribution and all the software menus are in Russian.

Continue reading

Posted in bioinformatics, book review, genomics, software | Tagged | 1 Comment

A call for statistical editors in ecology

Pop quiz for all you potential “statistical editors” out there: What’s this?

A new article in TREE wants to add a specialized reviewer to the peer review process. von Wehrden, Schultner, and Abson suggest that a statistical editor would expedite* the peer review process:

“The review process of a manuscript with imperfect statistics typically takes several months, while a statistical editor could return the manuscript to authors within days or weeks”

It’s a noble idea, but I don’t think that it will work. Continue reading

Posted in Uncategorized | 3 Comments

Molecular Ecology’s best reviewers 2015

A laurel wreath in the middle of nowhere(Flickr: Kathrin & Stefan Marks)

As a continuation of our post from last year, Molecular Ecology is publishing a list of our very best referees from the last two years (2013 and 2014). Our hope is that the people listed below will put ‘Top Reviewer for Molecular Ecology 2015′ on their resume, and that this will highlight to search committees and granting agencies that they have made a significant contribution to the community as a reviewer.

Everyone who completed a review for Molecular Ecology between 1st January 2013 and 31st December 2014 was eligible, and people were ranked by an index that included the number of reviews completed, the proportion of accepted review requests that led to a review being returned (excluding unassignments before the two week deadline), and the average time taken per review if this was over two weeks. The top 300 (~ 8%) are listed below the fold – thank you so much for your efforts!

Continue reading

Posted in housekeeping, Molecular Ecology, the journal, peer review, science publishing | Tagged , , | Leave a comment

A transcriptomic approach for reduced representation in population genomics

Purple sea urchin

The purple sea urchin, Strongylocentrotus purpuratus, the focal taxon of Pespeni et al. 2013

 

 

 

 

 

 

 

 

 

 

Many population genomics studies use methods that provide a reduced representation of the genome, for example RADseq or UCEs. Targeting a subset of the genome reduces the cost of sequencing and makes data analyses less computationally intensive. In their recent Molecular Ecology paper, De Wit, Pespeni, and Palumbi (2015) suggest using expressed gene sequences (i.e. transcriptomic data) as a reduced representation method for population genomic studies. The review walks through the strengths (greater accuracy of functional annotation) and weaknesses (potential biases such as allele-specific expression) of this approach.

This review summarizes current methods, identifies ways that using expressed sequence data benefits population genomic inference, and explores how current practitioners evaluate and overcome challenges that are commonly encountered. We focus particularly on the additional power of functional analysis provided by expressed sequence data and how these analyses push beyond allele pattern data available from non-function genomic approaches.

The review contains sections on 1) assembly quality, 2) marker development and genotyping after assembling the transcriptome, and 3) applications of expressed sequence datasets. Based on a summary of the current literature, the authors also provide a “best-practice” pipeline that starts with experimental design and sequencing platform choice and moves to raw data quality control, transcriptome assembly and evaluation, ending with a reference transcriptome ready for SNP discovery. Below I will highlight a few points I found to be interesting or useful from each section and conclude with a few new questions the review raised in my mind.

Assembly quality

Poorly assembled transcriptomes can result in the creation of false SNPs, where nucleotide differences in paralogous sequences are mistaken for polymorphisms, and the omission of real SNPs if allelic differences are considered as belonging to two separate genes instead of one. These errors can be reduced using tBLASTn searches querying your translated sequences against a high quality protein database from a closely related species, which will reveal incorrectly collapsed contigs when multiple orthologous proteins from the reference match a single collapsed contig. In the absence of a well annotated reference, the authors suggest a conservative approach that keeps allelic variants as separate “genes.” This will reduce the number of false positives, albeit at the expense of potentially missing real SNPs.

Normalizing libraries to correct for extremely highly expressed genes reduces the number of sequencing errors. Normalization can be done bioinformatically after sequencing with programs such as khmer. However, the effect of digital normalization on qualitative measurements of assembly (such as percent of conserved orthologs identified) has not yet been tested. Normalizing the libraries during the library prep stages will also reduce overly expressed transcripts, thereby increasing the representation of lowly expressed and potentially interesting transcripts. However, normalization at this stage prevents the collection of quantitative information about transcript abundance making expression analyses impossible.

Marker development and genotyping

Population genomics studies often use pooled population-wide data to directly estimate allele frequencies. Ideally, both alleles at a given locus in a heterozygous individual would be equally represented in the data, as would each individual in a pooled sample. However, data are often skewed for technical or biological reasons, such as PCR artifacts or allele-specific expression (when one allele is more highly expressed than the other), respectively. There is currently an active debate how these factors could potentially mislead genotype estimates. De Wit et al. advocate estimating allele-specific expression before analyzing pooled data and outline several approaches to do so.

Applications

The red abalone, Haliotis rufescens, the focal taxon of De Wit et al. 2014. Photo courtesy of Kevin Lee, asnailodessey.com

The red abalone, Haliotis rufescens, the focal taxon of De Wit et al. 2014. Photo courtesy of Kevin Lee, asnailodyssey.com

 

 

 

 

 

 

 

 

 

Population genomics is often interested in finding outlier loci under selection. Selection may act on many loci within regulatory or metabolic networks in such a way that individual loci fail to meet the stringent significance thresholds of outlier tests. However, De Wit et al. point out that by testing whether loci with high Fst are non-randomly clustered into functional categories, it is possible to infer the role of selection despite the absence of individually significant loci (for examples, see Pespeni et al. 2013 and De Wit et al. 2014).

A traditional use of transcriptomic data is to examine differences in gene expression among populations, which often play a role in population divergence and local adaptation. Combining expression data (determined from read counts) with SNP genotypes (determined from read sequences) allows for the examination of the functional role of SNPs in gene expression.

Final questions

As a scientist whose past and current research has focused on understanding the mechanisms responsible for allele frequency and gene expression differences among populations, it is extremely appealing to envision generating a single dataset that can be used to test both sets of questions. In their best practices pipeline, De Wit et al. suggest using sequence data collected across many different developmental phases, tissue types, sexes, and physiological stages for transcriptome assembly. They also suggest collecting 30-100 million reads to balance the coverage needed for gene discovery and downstream gene expression analyses with the accumulation of sequencing errors.

Determining how many and which individuals to sequence and the necessary depth of coverage are universal decisions for all NGS projects and here De Wit et al. outline the gold standard experimental design for projects using transcriptomic data for population genomics. Ideally, all our experiments would meet this gold standard but resources are often limiting and this review made me wonder: can a single experimental design provide the data required to test hypotheses about population genomics AND differential gene expression? Or does one question have to take a back seat? How far we can deviate from the best practices outlined here before we lose confidence in our results? Furthermore, in which questions (gene expression vs allele frequency differences) would we lose power to answer first?

De Wit et al. clearly demonstrate that determining SNP genotypes and differential gene expression simultaneously from transcriptomic data shows much promise for evolutionary ecology and I am excited to see how this approach evolves over time.

Reference

De Wit, P., Pespeni, M. H., & Palumbi, S. R. (2015). SNP genotyping and population genomics from expressed sequences – current advances and future possibilities. Molecular Ecology. DOI: 10.1111/mec.13165

Posted in genomics, howto, methods, Molecular Ecology, the journal, next generation sequencing, RNAseq | Leave a comment

The fickleness of P?

Halsey and colleagues (2015) raise an important issue regarding a certain letter with which we all are familiar:

© flickr

© flickr

They describe the sample-to-sample variability in the value as a major cause of lack of repeatability that is not generally considered. They explain

why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.

In their estimation, the omission of this variability reflects a general lack of awareness.

The statistical power of a test dramatically affects the capacity with which we can interpret a P value and as a consequence the result of the test.

I’ve been thinking more about power, with specific regard to molecular ecology and accurate sampling of organisms with complex life cycles (also see my interview with Sean Hoban and some of his work, also highlighted here).

The authors provide some background on the misunderstandings about P:

If statistical power is limited, regardless of whether the P value returned from a statistical test is low or high, a repeat of the same experiment will likely result in a substantially different P value and thus suggest a very different level of evidence against the null hypothesis.

To demonstrate this, they take samples drawn from two normally-distributed populations of data in which they knew there was differentiation. They take subsamples and find over replicate experiments (though in practice, we would likely only perform one experiment), that the P values vary quite a bit (see Figure 2, Figure 4)!

Only when the statistical power is at least 90% is a repeat experiment likely to return a similar P value, such that interpretation of P for a single experiment is reliable.

We usually want to find the direction of an effect, as well as its size and also its precision. Halsey et al. (2015) advocate for the increased use of effect size and its 95% CIs.

Discovering that P is flawed will leave many scientists uneasy. As we have demonstrated, however, unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best. Data analysis and interpretation must incorporate the uncertainty embedded in a P value.

References

Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015) The fickle value generates irreproducible results. Nature Methods 12, 179–185. doi:10.1038/nmeth.3288 

Posted in Uncategorized | 1 Comment

d(N)eutralist < d(S)electionist Part 4

Continuing our discussion of the neutralist-selectionist debate, recent findings by Schrider et al. (2015) bring us to the topic of selective sweeps, and their genomic signatures in a population. As we have discussed in previous posts, numerous studies (since the proposal of the neutral theory – Kimura 1968) have shown evidentially, the fixation of beneficial mutations due to positive selection, and their roles in adaptive evolution. While there are several proposed mechanisms driving positively selected alleles to fixation (see my previous post here for some thoughts on the effects of recombination in adaptive evolution), a very plausible (and increasing in evidence by the day) mechanism is one of selective sweeps, or the quick rise to fixation of a beneficial allele in a population (due to positive selection), and the subsequent depletion of linked neutral diversity around the allele (due to genetic hitchhiking). Classified into hard (initial frequency of the beneficial allele = 1/2N), soft (initial frequency > 1/2N due to presence of the allele near neutrality in the population until some perturbation, often environmental, that sets off the sweep), and partial (or incomplete, wherein the beneficial allele has yet to reach fixation in a population) classes, the detection of sweeps has been used extensively in recent years to describe signatures of selection across the genome.

Reduction in heterozygosity at a hitchhiking neutral locus - from a now classic manuscript by Maynard-Smith and Haigh (1974). Image courtesy: http://dx.doi.org/10.1017/S0016672308009579

Reduction in heterozygosity at a hitchhiking neutral locus – from a now classic manuscript by Maynard-Smith and Haigh (1974). Image courtesy: http://dx.doi.org/10.1017/S0016672308009579

Signatures of selection can be described using several summary statistics, including polymorphism levels, site-specific diversity, haplotype diversity, Tajima’s D, LD-based statistics, etc. Schrider et al. (2015) discuss via simulations, the efficacy of summary statistics in quantifying selective sweeps. In short, all summary statistics rely on (a) the depletion of genomic diversity around a selected site (eg. see Figure 2 from Maynard-Smith and Haigh 1974 above), and (b) haplotypic diversity – recent hard sweeps should produce one “fixed” haplotype around the selected site in high frequencies, versus soft/incomplete sweeps which should result in multiple haplotypes in intermediate frequencies around the selected site. But through the course of recombination between the selected allele, and a neutral allele, a not so recent hard sweep can yet produce multiple haplotypes of intermediate frequencies. Methods to detect sweeps would thus wrongly classify these as soft or partial sweeps, a phenomenon the authors term the “soft shoulder” effect. To describe this effect, the authors perform coalescent simulations under different scenarios of sweeps, by varying (a) the initial frequency of the sweeping allele, (b) time(s) of sweeps, and (c) the selection coefficients. Analyses of several summary statistics indicate unanimous support for the “soft shoulder effect”, with numerous false positives for the presence of soft/partial sweeps in sites linked to hard sweeping alleles. The authors thus recommend interpreting studies that perform genome-wide scans for the detection of positively selected sites (and sweeps) with care, and propose several suggestions:

  1. Analysis of flanking regions to detect selection (and sweeps), rather than just analysis of immediately surrounding the selected site.
  2. Applying methods that account for polymorphism, allele frequency, haplotype diversity, and LD based statistics,
  3. accounting for gene conversion rates,
  4. and importantly, checking for evidence of a nearby hard sweep, whenever a soft/partial sweep is found, to rule out the “shoulder effect”.

Reference:

Schrider, Daniel R., et al. “Soft shoulders ahead: spurious signatures of soft and partial selective sweeps result from linked hard sweeps.” Genetics (2015): genetics-115. http://dx.doi.org/10.1534/genetics.115.174912

Maynard Smith, J., and J. Haigh, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23-35.

Kimura, Motoo. “Evolutionary rate at the molecular level.” Nature 217.5129 (1968): 624-626.

Posted in adaptation, evolution, mutation, population genetics, selection, theory | Tagged , , | 1 Comment

Live fast and reproduce young

Here is one for the “simple, elegant science” folder: a new paper in PNAS by Julia Schroeder and colleagues that demonstrates a fitness disadvantage in offspring from older parents. While there a multitude of papers out there showing that gametes have reduced quality as an organism ages, this new work is the first to demonstrate this phenomenon in a natural system.

Schroeder et al show that a parent’s age has no effect on the longevity of their offspring, but the offspring of older parents have lower reproductive success over their lifetime. In addition, these effects are sex-specific: older males negatively affected their sons and older females negatively affected their daughters. To ensure that these effects weren’t primarily caused by environment, some of the offspring were moved to different parents before hatching out of their eggs.

Our results challenge the currently favored hypothesis in evolutionary biology and behavioral ecology that old age signals high quality in mating partners. Our results imply a substantial cost of reproducing with older, rather than younger, partners. The results inform increasing concern about delayed reproduction in medicine, sociology, and conservation biology.

Schroeder J., Nakagawa S., Rees M., Mannarelli M.E. & Burke T. (2015). Reduced fitness in progeny from old parents in a natural population, Proceedings of the National Academy of Sciences, 201422715. DOI: http://dx.doi.org/10.1073/pnas.1422715112

Posted in pedigree, population genetics, societal structure | Tagged , | Leave a comment