Identifying and correcting errors in draft genomes

Cumulative number of genomes sequenced over the past 3 decades (figure by Greg Zynda http://gregoryzynda.com/)


Over the past decade we have seen an exponential increase in the number of sequenced, assembled, and annotated genomes. These these genomes are essential for pretty much any genomics research. If you want to sequence the genome, transcriptome, epigenome, or whatever-ome of your super-special study species and population, you’ll need (or at least want!) a pretty solid (read: well-annotated) reference genome to which to align your sequence data.
Fortunately for you, genomicists have been sequencing pretty much any genome that they can get their hands on. Unfortunately, these genomes are first published in “draft” form and come with a multitude of potential errors. These errors are highlighted in a recent paper by James Denton and colleagues. Here’s the one-sentence summary of their paper:

Low-quality assemblies result in low-quality annotations, and these annotation errors cause both the over- and under-estimation of gene numbers.

The good news is that:

many genome assemblies and annotations have improved over time due to further efforts aimed at both increasing sequence contiguity and adding functional data (e.g. RNA-seq) in order to correct gene models.

… but the bad news is that:

it is often the case that a great deal of research will be based upon the draft assembly before it has reached a finished state, and erroneous conclusions may result.

More specifically, in this paper the authors compared the most up-to-date genomes (from fruit flies to chickens to chimpanzees) to their draft-genome predecessors. What they found was that:

low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved”* gene models)… Upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes.

(*”cleaved” gene models are those in which multiple genes are estimated from sequences that actually came from just one gene.)
Their findings make sense. If you are sequencing fragments of the genome then the prediction algorithms will be more likely to assign fragments from different exons, which may be far apart, to different genes. These cleaved gene models lead to an overestimation of single-exon genes and a depletion of multi-exon genes.
Alas, there is hope, and this hope comes in the form of RNA-sequencing. The authors found that paired-end RNA-sequencing improves the annotation of genomes by connecting the cleaved genes.
Overall, this suggests that caution should be taken when using and interpreting draft genomes. Use them with caution and, if you can, improve the annotation by sequencing your organism’s transcriptome.
Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, et al. (2014) Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies. PLoS Comput Biol 10(12): e1003998. doi: 10.1371/journal.pcbi.1003998

Posted in Uncategorized | Leave a comment

Et tu, Brute? Black-legged ticks use genes co-opted from bacteria to fight bacterial infection

zBlah blah

Female black-legged tick, Ixodes scapularis. Photo courtesy of Brian Leydet.


Horizontal gene transfer occurs when genes are passed between individuals by mechanisms other than reproduction. It is common in bacteria and occasionally happens between highly divergent groups (for example, monocot genes transferred to eudicots, fungal genes transferred to aphids, bacterial genes transferred to cnidarians, and my personal favorite, algal genes transferred to a sea slug, making it photosynthetic!).
A possible benefit for a eukaryotic recipient of a horizontally transferred bacterial gene is the ability to produce antibacterial compounds, which have evolved in prokaryotes through competition. In their recent Nature paper, Chou et al. showed type VI secretion amidase effector (tae) genes have been transferred from bacteria to eukaryotes at least six times and that domesticated versions of the gene (dae) have been selectively retained in eukaryotes for millions of years.
The authors also presented several lines of evidence supporting the adaptive function of dae genes transferred to the black-legged tick* (Ixodes scapularis) a vector for Borrelia burgdorferi, which causes Lyme disease: i) the dae2 gene is expressed in nymph and adult phases of the deer tick, ii) Dae2, a protein isolated from the tick, is found in tick salivary glands and the mid-gut, locations where the B. burgdorferi would be encountered by the tick during transmission of the bacteria via a blood meal, iii) Dae2 proteins isolated from the tick have antimicrobial properties (more specifically, the proteins can degrade bacterial cell walls). Essentially, dae2 contributes to the innate ability of the tick to control B. burgdorferi levels after infection.
Taken together, these results suggest dae2 protects ticks from infection by the very organism from which the gene was horizontally transferred.
*Just a little taxonomic aside for your Monday morning- Ixodes scapularis isnt actually the deer tick! According to Brian Leydet, a postdoctoral fellow at the Trudeau Institute who studies I. scapularis and its associated pathogens, “black-legged tick is the accepted common name for I. scapularis as determined by the Entomological Society of America. Deer tick was the common name for Ixodes dammini, once considered a separate species, but now a junior synonym of Ixodes scapularis.” Beware, none the less. Both I. scapularis and I. dammini are vectors for Lyme disease.
Chou, Seemay, et al. “Transferred interbacterial antagonism genes augment eukaryotic innate immune function.” Nature (2014). DOI: 10.1038/nature13965

Posted in adaptation, genomics, horizontal gene transfer, microbiology | Leave a comment

"Hurrah! Hurrah!" DNA barcoding and the lost story of Darwin's meadow

Charles Darwin's famous "thinking path" at Down House, left, is adjacent to Great Puckalnds Meadow, right, where Darwin carried out what is perhaps among the first intentional, comprehensive species counts in a geographically defined area in history. Photo: Karen James

Charles Darwin’s famous “thinking path” at Down House, left, is adjacent to Great Puckalnds Meadow, right, where Darwin carried out what is perhaps among the first intentional, comprehensive species counts in a geographically defined area in history. A team from the Natural History Museum in London resurveyed the meadow in 2007. Photo: Karen James


Five years ago, I was a co-author on a consortium paper in PNAS that recommended two genes to serve as universal markers for DNA-based identification (DNA barcoding*) of plants. Five years ago, the world celebrated Charles Darwin’s 200th birthday. You might not think those two events are related, but they are. Here’s how.
The paper marked the end of a very long road. DNA barcoding as an international, standardized endeavor got underway in earnest around 2005, but the gene chosen and officially endorsed by the Consortium for the Barcode of Life (CBOL) to serve as the DNA barcode for animals, CO1, is not variable enough to discriminate among plant species. The CBOL Plant Working Group was formed to solve this problem by finding an alternative gene or genes (or between-gene region or regions) for plants. We spent four years generating, pooling, analyzing, and, hardest of all, making a decision based on a very large data set of PCR amplification success rates and DNA sequences from seven candidate genes (or regions) from 907 specimens from 550 species representing the major groups of land plants. You can learn more about the study, the results, and our recommendation to the DNA barcoding community in the (open access) paper in PNAS.
What you can’t learn more about in the paper is the story of the 138 specimens that a group of us from the Natural History Museum in London, where I was a postdoctoral researcher at the time, contributed towards that 907-specimen total. Because the paper was about what the combined data from multiple researchers could tell us about prospects for DNA barcoding in plants, it wasn’t the time or place to tell the stories of researchers’ individual projects. And because the narrative of a project can’t really be published on it’s own without (as yet unpublished) data to go along with it, I was afraid this meant the story of our project would go unpublished, which would be a real shame because it’s a good story.
Continue reading

Posted in DNA barcoding, natural history | Tagged , | 1 Comment

Week in review, 6 December 2014


It’s been a busy week at *The Molecular Ecologist! Here’s a roundup of our latest posts:*
Melissa pointed out a study of compensatory evolution in yeast, in which natural selection found a way around the loss of many different genes.
Noah described how genetic pedigree reconstruction illuminates the social structure of hamadryas baboons.
Rob pointed out a cool new review of isolation-by-environment, the adaptationist cousin of isolation-by-distance.
Arun went in depth to explain some new data about the evolution of recombination in the recent ancestors of modern humans.
Stacy described methods for genetic detection of exotic gene flow from forest plantations into populations of native trees.

Posted in linkfest | Leave a comment

Exotic gene flow surveillance

Exotic forest plantations often cover large areas and, as such, may contribute female gametes, male gametes and/or zygotes to native stands. In seed plants, these three components of exotic gene flow have not been distinguished, though they will have different genetic and demographic consequences. For example, zygotic gene flow, in which exotic mothers are pollinated by exotic fathers, may result in heterozygote deficiency. In contrast, male (in which native mothers are pollinated by exotic fathers) and female gametic (in which exotic mothers are pollinated by native fathers) gene flow will generate heterozygote excesses.

Martime pine. © mdo.emecstudios.com

Martime pine. © mdo.emecstudios.com


Unger et al. (2014) present an approach in which uni- and biparentally inherited markers (a set of chloroplast and nuclear microsatellites) were used to estimate contemporary exotic gene flow into two relict pine stands, Pinus pinaster and P. sylvestris. The authors built upon the Estimation of Seed and Pollen Migration Rates model (ESPM) from Robledo-Arnuncio (2012) by combining the two types of markers to estimate the three types of gene flow.

The method [of decomposing the total gene flow rate into the zygotic and gametic components] should thus be useful for plant ecologist and ecosystem managers … In [their] two-population scenarios, these double-migration events translate into the arrival of seeds born to an external mother and a local father (i.e., female gametic gene flow only), while in systems with three or more populations to which the model could be generalized, they could also involve immigration of seeds born to two external parents from different populations (i.e., female and male gametic gene flow from different populations).

The results of this study represent the first steps to genetically monitoring exotic forest plantations and the genetic and demographic consequence on native stands. Future work in native and exotic stands is necessary in order to measure levels of inbreeding depression (particularly in native stands), genetic variation at ecologically relevant traits and fitness differences between native, exotic and putative hybrids.
Robledo-Arnuncio JJ (2012) Joint estimation of contemporary seed and pollen dispersal rates among plant populations. Molecular Ecology Resources 12, 299-311. doi: 10.1111/j.1755-0998.2011.03092.x
Unger GM, Vendramin GG, Robledo-Arnuncio JJ (2014) Estimating exotic gene flow into native pine stands: zygotic versus gametic components. Molecular Ecology 23: 5435-5447. doi: 10.1111/mec.12946

Posted in conservation, Molecular Ecology, the journal, population genetics, Uncategorized | Leave a comment

The Evolution of Recombination

In a recent publication, Lesecque et al (2014). provide key evidence that fills in some of the blanks to an age old question – how do recombination hotspots evolve? Their analyses of major PRDM9 (a polymorphic zinc finger protein with a DNA binding domain at recombination hotspots) target sites on genomes along hominidae indicate several interesting patterns: (a) human hotspots are relatively young, (b) they have enhanced levels of GC-biased gene conversion, (c) Denisovans have different hotspots than humans, despite relatively recent divergence, and shared PRDM9 target motifs, and evidently, (d) recombination hotspots have a fast turnover rate, indicating strong support for what’s come to be known as the ‘Red Queen Theory’ of the evolution of recombination hotspots.
In light of these neat findings, I thought it would be interesting to revisit some classical theoretical considerations on the bigger picture – the evolution of recombination itself.

 Rapid turnover of recombination hotspots: the comparative analysis of modern and archaic human genomes supports a model of Red Queen evolution.

Image Credit: Pauline Sémon doi:10.1371/image.pgen.v10.i11.g001


The primary evolutionary advantage of recombination has to do with the propagation and eventual fixation of new and/or recurrent favorable mutations (and maintaining diversity at the population level) at multiple loci – something that is slow and less likely in a non-recombinant population (where favorable mutations can be fixed in a population only if they occur in offspring of a mutant). Thus a sexually recombining population should predictably evolve faster – calling for selection for the recombination machinery (Fisher (1932), Muller (1932)). On the flip side, non-recombining populations also accumulate deleterious mutations faster, in a phenomenon commonly called ‘Muller’s rachet’ (1964).
Consequently, the rate of evolutionary change prompted by recombination depends on (a) initial frequency of mutant alleles – Crow and Kimura (1965), (b) linkage between loci – Maynard-Smith (1968), (c) effective size of the population (and drift) – Otto and Barton (2007), and (d) strength of selection due to environmental change – explained in some very interesting theory by Hill and Robertson in what’s come to be known eponymously as the ‘Hill-Robertson Effect’ (1966) – see Felsenstein (1974) for review and references.
Continue reading

Posted in adaptation, genomics, mutation, population genetics, theory | Tagged , | 2 Comments

Isolation by environment explains why the grass isn't always greener


Ever since Sewall Wright introduced isolation by distance in 1943, the interplay between genetic differentiation and geographic distance has been a foundational, sometimes frustrating, aspect of population genetics studies.
But distance isn’t just distance. The walk to my car isn’t any longer when rain is pouring down, but it sure feels that way. Continue reading

Posted in methods, Molecular Ecology, the journal, population genetics | Tagged , , , | 2 Comments

The genetics of another multi-level society

Hamadryas baboon female and infant Photo by Noah Snyder-Mackler

Hamadryas baboon female and infant
Photo by Noah Snyder-Mackler


Long-time readers (i.e., “for more than one week”) of The Molecular Ecologist will notice that this is the 2nd post on the socio-genetics of a primate multi-level society. The first being Melissa’s post last week that covered my recent paper on the genetics of the multi-level society of the gelada monkey. Now, there’s a new paper on the multi-level genetic structure of the gelada monkey’s close relative, the hamadryas baboon.
Hamadryas social structure
Multilevel societies are identified by two or more nested levels of organization. Hamadryas baboon society has four levels of organization. The smallest level, and core group, is the one-male unit (OMU), which is composed of (you guessed it) one-male and multiple females. OMUs aggregate to form groups of increasing size from clans, the smallest aggregation of OMUs, to bands to troops, the largest aggregations of OMUs, which can contain several hundred baboons. So, at least on the surface of things, this society appears to be strikingly similar to that of the gelada monkeys. But are the genetic underpinnings the same?
Researchers at the Max Planck Institute for Evolutionary Anthropology and the Filoha Hamadryas Project set out to address this question. They genotyped 244 baboons at 1 Y-linked and 23 microsatellite loci and also sequenced part of the mitochondrial d-loop (HVR-1). These genetic data, in combination with years of behavioral data on association patters, revealed that females dispersed farther than males, which suggests that it is closely related males that are the “glue” keeping clans together. Interestingly, they also found some evidence for limited dispersal among females, suggesting that they maintain bonds with close kin before and after dispersal.

We speculate that male philopatry at the clan level and female dispersal across one-male units and clans may enable both kin-based cooperation among males and the maintenance of kin bonds among females after dispersal.

So it appears that the closely related geladas and hamadryas baboons have converged on superficially similar, but fundamentally different multilevel societies.
Fun fact: Geladas and hamadryas baboons are sympatric in the the highlands of Ethiopia. Here’s a photo that I took of a mixed herd of geladas and hamadryas baboons. A male hamadryas is walking away in the background and two female geladas are in the foreground:

Hamadryas baboon and gelada monkeys foraging together Photo by Noah Snyder-Mackler

Hamadryas baboon and gelada monkey foraging together
Photo by Noah Snyder-Mackler


I wonder what their hybrid socio-genetic structure would be?
References:
Städele V, Van Doren V, Pines M, Swedell L & Vigilant L (2014) Fine-scale genetic assessment of sex-specific dispersal patterns in a multilevel primate society. J. Hum. Evol.
Schreier AL & Swedell L (2009) The fourth level of social structure in a multi-level society: ecological and social functions of clans in hamadryas baboons. Am. J. Primatol. 71, 948–955.
Jolly CJ, Woolley-Barker T, Beyene S, Disotell TR & Phillips-Conroy JE (1997) Intergeneric hybrid baboons. Int. J. Primatol. 18, 597–627.

Posted in pedigree, societal structure | Tagged | Leave a comment

Compensatory evolution: a possible mechanism of population divergence

Dr. Horrible. Photo courtesy of io9.com

Dr. Horrible. Photo courtesy of io9.com


After spending my graduate career using genetic data to reconstruct historical demographic events, one of the things that excite me the most about my postdoc work is the opportunity to use experimental methods to make evolution happen (insert mad scientist laugh here). Manipulative experiments on organisms with short generation times are a great way to study how populations and their genomes adapt in response to mutation, selection, and/or environmental change (for a review see Barrick and Lenski 2013).
Continue reading

Posted in adaptation, genomics, mutation, yeast | 3 Comments

Caught sweeping 'cross the sea

 

Sea lice are regularly monitored and counted on fish at a salmon farm.  © cermaq.com

Sea lice are regularly monitored and counted on fish at a salmon farm. © cermaq.com


The salmon louse Lepeophtheirus salmonis is an ectoparasite linked to declines in wild salmonid populations as well as causing huge economic losses in salmon farms. Previous studies, using a variety of molecular markers, yielded conflicting results ranging from strong genetic differentiation among nearby farms to no structure across the entire North Atlantic.
Besnier et al. (2014) investigated the effects of anthropogenically-driven rapid evolution to pesticide resistance using a SNP-array. Emamectin benzoate, or EMB, is the most commonly-used pesticide to control L. salmonis in the Atlantic.

 [EMB] resistance developed at a single source, and rapidly spread across the Atlantic [within a decade] … and importantly demonstrates that alleles conveying resistance to pesticides may be quickly spread over very large areas in the marine environment.

From a management perspective, this study demonstrates the necessity of ocean-wide policies, rather than management at the regional level.
From a population genetic perspective, the seascape in which L. salmonis is evolving is extremely heterogeneous:

 with patches of high host density in salmon farms and coastal areas, and large areas of low host density in the offshore regions.

Thus, standard popgen protocols with which to investigate different evolutionary dynamics may be difficult to apply due to biased estimates in these sea louse population(s).
Finally, from an evolutionary perspective, the strong selective sweeps detected in this study strongly suggest that:

L. salmonis has a high capacity to spread new advantageous mutations across [ocean basins] in the time scale of just a few generations [max 11 years] … thus corroborating concerns that pesticide resistance can develop and rapidly spread over large areas on an ecological time-scale.

 
 
Besnier F, M Kent, R Skern-Mauritzen et al. (2014) Human-induced evolution caught-in action: SNP-array revels rapid amphi-atlantic spread of pesticide resistance in the salmon ectoparasite Lepeophtheirus salmonis. BMC Genomics 15: 937. http://www.biomedcentral.com/1471-2164/15/937
 
 

Posted in adaptation, genomics, mutation, next generation sequencing | Leave a comment