The N50 filtering problem

This is the second in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it, give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

The problem with N50 (or Nx in general) is not the number itself but that it can be highly misleading when describing sequence assemblies. In an ideal world, the optimal genome assembly would consist of a few contigs representing entire chromosome sequences, leading to a high N50 value. In contrast, a poor assembly of low quality would instead consist of a massive number of tiny, fragmented contigs, leading to a low contig N50. This is the reason why people generally view larger N50 values as indicative measures of better assemblies. However, this is not always the case, and the N50 number can be manipulated in various ways, falsely giving the impression that the assembly is of higher quality than it is.

To illustrate how easy it is to increase one’s contig N50 value, we will borrow a filtering example from the great ACGT blog by Keith Bradnam (unfortunately no longer updated).
Imagine that you perform a de novo genome assembly and end up with the following contig distribution:

  • 5,000 contigs of 100 bp length
  • 100 contigs of 10 kbp
  • 10 contigs of 1 Mbp
  • 1 contig of 10 Mbp

A common strategy of filtering genome assemblies is to remove very short contigs, since these may not be biologically meaningful or useful in your analyses anyway. But where should we draw the line between contigs to keep and contigs to filter out? Let’s say that we produce three assembly versions with three different filtering criteria based on the example assembly above. Assembly1 includes all contigs, Assembly2 filters out the shortest (100 bp) contigs, and Assembly3 removes both the shortest and the second shortest (10 kbp) contigs.

Let’s start by calculating the average contig lengths of the three assemblies.

Mean contig lengths:

  • Assembly1 = 4,207 bp
  • Assembly2 = 189,189 bp
  • Assembly3 = 1,818,182 bp

Not surprisingly, the average contig length increases drastically when we remove the shorter contigs. Now, let’s see what the assemblies’ respective N50 number tells us.
N50 contig lengths:

  • Assembly1 = 1 Mbp
  • Assembly2 = 1 Mbp
  • Assembly3 = 10 Mbp

The N50 now tells us that Assembly 1 and 2 are equally good. And if we are naive enough, we might conclude that Assembly3 is the best version, it’s even ten times better than the other two assemblies! Yet, all of them are derived from the exact same initial assembly, and have only been filtered more or less stringently. As you see, the N50 number can in this way greatly fool the reader.

Next week, we’ll continue to describe another problem with the N50 metric.

This post has been jointly written by Elin Videvall, Andrey Prjibelski, and Alexey Gurevich.

This entry was posted in genomics and tagged , , . Bookmark the permalink.