A solution to the N50 filtering problem

This is the fourth in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it (1, 2), give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

In the two previous posts we described how the N50 metric can be easily manipulated in two common but different ways. The first problem is related to filtering of contigs. This N50 filtering problem can be easily solved if the approximate genome length of the organism is known. In this case, we can compute something called NG50 (where G stands for Genome) instead of N50. This statistic is defined similarly to N50 but instead of reaching 50% of the total assembly length, we would try to reach 50% of the genome length (see example in Fig. 5).

Note that NG50 may be larger than N50 (if the assembly length is larger than the genome length), may be equal (if both lengths are somewhat similar), may be less (if the genome is larger than the assembly), and may even be undefined (if the total assembly length is less than half of the genome length).

Fig. 5. Example assembly of a 500 kbp genome consisting of seven contigs. NG50 = 50 kbp, N50 = 60 kbp.

Let’s recall our first assembly example from the N50 filtering problem and instead of N50, we will now look at the assemblies’ NG50. We remind you that our Assembly1 consisted of the following contigs:

  • 5,000 contigs of 100 bp length
  • 100 contigs of 10 kbp
  • 10 contigs of 1 Mbp
  • 1 contig of 10 Mbp

We also constructed an Assembly2 by keeping only the contigs longer than 100 bp and an Assembly3 with only contigs longer than 10 kbp. Let’s also imagine that these assemblies correspond to an organism with a genome length of 22 Mbp.
Now, let’s see what the assemblies’ respective NG50 numbers are:

  • Assembly1 = 1 Mbp
  • Assembly2 = 1 Mbp
  • Assembly3 = 1 Mbp

All of them are now equal! The reason why is because all of these assemblies were constructed from the same initial assembly and filtering of contigs should not affect assembly quality (in terms of long contigs).

If we would create an additional Assembly4 by keeping only contigs longer than 1 Mbp (a single 10 Mbp contig, in fact), we will end up with an undefined NG50. The reason why that happens is because Assembly4 would contain less than 50% of the genome length (22/2 = 11 Mbp). This also means that our filtering strategy was too strict and we should consider a lower threshold.

The NG50 metric was first defined and used in the Assemblathon 1 paper (Earl et al. 2011). Worth noting, NG25, NG90 and NGX in general can be defined alongside NG50 as we do for derivatives of the regular N50 statistic.

Next week, we’ll take a look at some of the solutions available to solve the N50 misassembly problem.

This post has been jointly written by Elin Videvall, Andrey Prjibelski, and Alexey Gurevich.

References

Earl et al. 2011. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research. doi: 10.1101/gr.126599.111

This entry was posted in genomics and tagged , , , . Bookmark the permalink.