A solution to the N50 misassembly problem

This is the fifth in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it, give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

The misassembly problem of N50 that we described in post #3 relates to erroneous joining of small contigs into larger ones. This problem is a bit more complicated and difficult to solve than the filtering problem. Various solutions have been proposed in the early 2010’s:

  • corrected Nx (Salzberg et al. 2011)
  • contig path Nx over alignment graph (Earl et al. 2011)
  • normalized N50 (Mäkinen et al. 2012)
  • NAx (where A stands for Aligned) (Gurevich et al. 2013)

All these approaches are pretty similar and require a high quality reference genome of the sequenced organism. We will here take a look at the definition of NAx.

To compute NA50 of an assembly in respect to the reference genome, we perform sequence alignment of contigs to the genome. Contigs containing erroneous joints (misassemblies) are split into aligned blocks and are subsequently aligned independently to distinct fragments of the genome. If we compute the N50 statistic for the set of aligned blocks (rather than initial set of contigs), we obtain the NA50 value. Note that even if some of the contigs fail to align, NA50 is still computed with respect to 50% of the total assembly length (which includes both aligned and unaligned contigs) to be compatible with N50.

Let’s try the NA50 approach using our hypothetical assembly from post #3. We had a Correct Assembly consisting of four contigs 1 Mbp each (Fig. 6a), and an Incorrect Assembly where these four contigs had been misassembled into a single one (Fig. 6b). N50 are 1 Mbp and 4 Mbp, respectively. Since the N50 of the Incorrect Assembly is four times larger, one can easily be fooled to think it’s better, but we will now evaluate this using NA50 instead.

Fig 6a. Correct Assembly with N50 = 1 Mbp and 0 misassemblies.

Fig 6b. Incorrect Assembly obtained by contig merging of the correct Assembly. N50 = 4 Mbp, 3 misassemblies.

Continue reading

Posted in genomics | Tagged , , , , | Leave a comment

Friday Action Item: Your #MarchForScience checklist

Sign at the Boston Rally for Science, back in February. (Flickr: AnubisAbyss)

On Fridays while the current administration is in office we’re posting small, concrete things you can do to help make things better. Got a suggestion for an Action Item? E-mail us!

Tomorrow, scientists and science supporters around the world will rally in support of science’s role in society. The flagship March for Science in Washington, D.C. has been plagued by confused messaging and failures to include the full diversity of people working in and interested in science — many of our readers may be going to other Earth Day events or to better-organized satellite marches. Still, I’m hopeful that tomorrow can be the start of a scientific community that is better engaged with the rest of society. Several of us at TME, including me, will be on the National Mall for the march in Washington, and others will be at their local satellite marches. If you’re planning to participate, here’s a few things you’ll want to think about before tomorrow morning:

  • Check the event details at the March (or satellite/alternative event) website — what’s the march route and the start time? What will you be able to bring with you?
  • Check the weather forecast, and dress accordingly. Wear comfortable shoes!
  • Organize to meet up with like-minded folks within larger events — for instance, members of the American Society of Naturalists, Society of Systematic Biology, and Society for the Study of Evolution are going to try to meet before the D.C. march, at Federal Triangle.
  • Make your sign. There’s lots of scope for clever, science-y slogans. “Science not Silence” looks like it’s popular, but I also like “Everybody needs science/ Science needs everybody”, and “We’re here, we’re peer-reviewed, get used to it” — there’s more inspiration in this Twitter thread and this Flickr album from the Boston Rally for Science in February. Bonus points if you can tie your slogan into your own research.
  • Speaking of your own research, have an elevator-pitch version of it ready to go — you will (hopefully) be meeting members of the science-supporting public, and maybe even talking to journalists covering the event, so be prepared to explain what your role in science is, and why it’s important for society.

See you in the streets!

Posted in Action Item, politics, United States | Tagged | Leave a comment

The Hype Cycle of Ancient DNA

Recently I saw a graph that I’ve learnt is called the Hype Cycle and is a methodology used in assessment of new technologies and their marketing. What strikes me about it is how well it fits my own research field, paleogenetics or the ancient DNA research.

The Hype Cycle is a graphical tool developed by Gartner, an information technology research and advisory company based in Connecticut. The Hype Cycle depicts five phases of evolution of a new technology, concentrating on the relationship between hype and real adoption of the technology.

Phase I: Innovation Trigger

A potential technology breakthrough kicks things off. Early proof-of-concept stories and media interest trigger significant publicity. Often no usable products exist and commercial viability is unproven. (© Gartner)

The good old times when anything containing the words “DNA” and “ancient/extinct” got to Nature.

It’s not difficult to identify what was the Innovation Trigger in ancient DNA. The origins of the field trace back to mid-1980s when reports of DNA from quagga, an extinct equid, and Egyptian mummies were published.

While the first study, on a 120-year-old quagga (Higuchi et al. 1984), came from the lab of the prominent evolutionary biologist Allan Wilson, the work on Egyptian mummies (Pääbo 1985) was done by then a PhD student, Svante Pääbo, as a secret side-project. Later, Pääbo went to Wilson’s lab and the collaboration of these two men yielded some outstanding studies, regularly occurring in Nature.

Phase II: Peak of Inflated Expectations

Early publicity produces a number of success stories — often accompanied by scores of failures. Some companies take action; many do not. (© Gartner)

In the following years, the world has seen DNA sequences of the Tasmanian wolf (Thomas et al. 1989), New Zealand moa (Cooper et al. 1992), and the woolly mammoth (Hagelberg et al. 1994) resurrected, but the scientists didn’t restrain to studying animals. Ancient DNA has been also retrieved from plants, e.g. maize (Rollo et al. 1988), and (fanfares) humans (Hagelberg et al. 1989).

Continue reading

Posted in evolution, natural history, Paleogenomics, phylogenetics, population genetics, theory | Tagged , , | 3 Comments

A solution to the N50 filtering problem

This is the fourth in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it (1, 2), give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

In the two previous posts we described how the N50 metric can be easily manipulated in two common but different ways. The first problem is related to filtering of contigs. This N50 filtering problem can be easily solved if the approximate genome length of the organism is known. In this case, we can compute something called NG50 (where G stands for Genome) instead of N50. This statistic is defined similarly to N50 but instead of reaching 50% of the total assembly length, we would try to reach 50% of the genome length (see example in Fig. 5).

Note that NG50 may be larger than N50 (if the assembly length is larger than the genome length), may be equal (if both lengths are somewhat similar), may be less (if the genome is larger than the assembly), and may even be undefined (if the total assembly length is less than half of the genome length).

Fig. 5. Example assembly of a 500 kbp genome consisting of seven contigs. NG50 = 50 kbp, N50 = 60 kbp.

Continue reading

Posted in genomics | Tagged , , , | Leave a comment

To RADseq or not to RADseq?

In the end, we all want to do the best science we can, on the budget we have.

It’s a cliche to say that we live in a moment of unprecedented possibility for molecular ecology, as high-throughput sequencing methods drive the cost of collecting DNA sequence data ever lower. But at the same time, it’s a tricky moment, because the future — in which population genomic data for any species is within, say, the scope of a standard NSF grant proposal — is still unevenly distributed. For study species with small genomes and established resources like high-quality reference assemblies and deep annotation databases, the future is now. For species with large and complex genomes, or without good “infrastructure” to build on, it can still be challenging to obtain useful population-scale data without spending hundreds of thousands of dollars.

For going on a decade, now, the go-to solution for this problem has been reduced-representation sequencing. Led by RADseq, or restriction site-associated DNA sequencing, these methods solve the problem of genomes that are too big to easily sequence by, as it says on the tin, reducing them. Reduced representation offers us an accessible means to identify parts of the genome are involved in species’ adaptation to different environments and, ultimately, the formation of new species — one of the key questions of evolutionary ecology. So it’s no surprise that RADseq and its relatives have been hugely popular. The method was name-checked in the 2010 “Breakthrough of the Year” feature in Science, and the original RADseq papers, published in 2007 and 2008, have almost 2000 citations, as counted by Google Scholar.

So any paper that proposes there may be some problems with RADseq is bound to be controversial. An article published in Molecular Ecology Resources back in December leaned into that controversy right from its title: “Breaking RAD: An evaluation of the utility of restriction site associated DNA sequencing for genome scans of adaptation.” MER has now published the second of two response articles, and a response from the authors of “Breaking RAD” to those responses, so it seems like a good time to break down the reasoning for, and against, RADseq.

Continue reading

Posted in adaptation, association genetics, genomics, methods, next generation sequencing, selection | Tagged , , , , | 7 Comments

You can call her queen bee: the role of epigenetics in honeybee development

Insects have social lifestyles that are often organized in castes. Within the insect community, different individuals specialize, each having a unique role. This efficient method of doling out the workload, ultimately, is believed to be why social insect lifestyles are successful. However, how it’s determined who does what is really pretty cool.

Continue reading

Posted in genomics, haploid-diploid, Molecular Ecology, the journal, next generation sequencing, RNAseq | Tagged , , | Leave a comment

The N50 misassembly problem

This is the third in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it, give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

In our previous post, we highlighted one problem with N50 and showed a common and easy way to inflate this metric by filtering of shorter contigs. There is, however, a second problem with the N50 metric: it does not consider correctness of an assembly at all. You can therefore easily increase your N50 by using an assembler that incorrectly joins contigs together. Let’s consider a trivial example.

You perform a de novo genome assembly of a 4 Mbp genome (Fig. 4a) and end up with four contigs of length 1 Mbp each (Fig. 4b). Let’s also assume these contigs are correct with respect to the reference genome. The N50 of your assembly will be 1 Mbp. However, you can easily create a new assembly with four times higher N50 by simply merging together contigs into one (Fig. 4c). Your new assembly will now be much worse than your previous one, with incorrect merging points (misassemblies), but it will have a much higher N50.

4a. Hypothetical reference genome

4b. Correct assembly with N50 = 1 Mbp and 0 misassemblies.

4c. Incorrect assembly obtained by merging of contigs. N50 = 4 Mbp, 3 misassemblies.

Continue reading

Posted in genomics | Tagged , , | Leave a comment