This is the fifth in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it, give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.
The misassembly problem of N50 that we described in post #3 relates to erroneous joining of small contigs into larger ones. This problem is a bit more complicated and difficult to solve than the filtering problem. Various solutions have been proposed in the early 2010’s:
- corrected Nx (Salzberg et al. 2011)
- contig path Nx over alignment graph (Earl et al. 2011)
- normalized N50 (Mäkinen et al. 2012)
- NAx (where A stands for Aligned) (Gurevich et al. 2013)
All these approaches are pretty similar and require a high quality reference genome of the sequenced organism. We will here take a look at the definition of NAx.
To compute NA50 of an assembly in respect to the reference genome, we perform sequence alignment of contigs to the genome. Contigs containing erroneous joints (misassemblies) are split into aligned blocks and are subsequently aligned independently to distinct fragments of the genome. If we compute the N50 statistic for the set of aligned blocks (rather than initial set of contigs), we obtain the NA50 value. Note that even if some of the contigs fail to align, NA50 is still computed with respect to 50% of the total assembly length (which includes both aligned and unaligned contigs) to be compatible with N50.
Let’s try the NA50 approach using our hypothetical assembly from post #3. We had a Correct Assembly consisting of four contigs 1 Mbp each (Fig. 6a), and an Incorrect Assembly where these four contigs had been misassembled into a single one (Fig. 6b). N50 are 1 Mbp and 4 Mbp, respectively. Since the N50 of the Incorrect Assembly is four times larger, one can easily be fooled to think it’s better, but we will now evaluate this using NA50 instead.
Fig 6a. Correct Assembly with N50 = 1 Mbp and 0 misassemblies.
Fig 6b. Incorrect Assembly obtained by contig merging of the correct Assembly. N50 = 4 Mbp, 3 misassemblies.