The N50 misassembly problem

4a. Hypothetical reference genome

This is the third in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it, give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

In our previous post, we highlighted one problem with N50 and showed a common and easy way to inflate this metric by filtering of shorter contigs. There is, however, a second problem with the N50 metric: it does not consider correctness of an assembly at all. You can therefore easily increase your N50 by using an assembler that incorrectly joins contigs together. Let’s consider a trivial example.

You perform a de novo genome assembly of a 4 Mbp genome (Fig. 4a) and end up with four contigs of length 1 Mbp each (Fig. 4b). Let’s also assume these contigs are correct with respect to the reference genome. The N50 of your assembly will be 1 Mbp. However, you can easily create a new assembly with four times higher N50 by simply merging together contigs into one (Fig. 4c). Your new assembly will now be much worse than your previous one, with incorrect merging points (misassemblies), but it will have a much higher N50.

4b. Correct assembly with N50 = 1 Mbp and 0 misassemblies.
4c. Incorrect assembly obtained by merging of contigs. N50 = 4 Mbp, 3 misassemblies.

As you can see, the N50 metric is very easy to manipulate and can therefore be highly misleading.

Now, put yourself in the shoes of a developer who is creating an assembler software. You can configure the assembler to be more or less sensitive when joining contigs and reads together. Making the assembler slightly more prone to join sequences together will lead to assemblies with higher N50. This might make your assembler look favorable compared to a competing lab’s software, it might attract more users who are pleased about their higher N50, and you might get more citations. But a wrongly assembled genome with lots of chimeras is not better than one with multiple contigs and low N50, so this is important to keep in mind.

Next week, we’ll start to look at some of the solutions that are available to solve these two N50 problems we’ve discussed.

This post has been jointly written by Elin Videvall, Andrey Prjibelski, and Alexey Gurevich.

This entry was posted in genomics and tagged , , . Bookmark the permalink.