A solution to the N50 misassembly problem

This is the fifth in a series of posts where we explain the N50 (Nx) metric, discuss the problems surrounding it, give solutions to those problems, and suggest an alternative N50 metric for transcriptome assemblies.

The misassembly problem of N50 that we described in post #3 relates to erroneous joining of small contigs into larger ones. This problem is a bit more complicated and difficult to solve than the filtering problem. Various solutions have been proposed in the early 2010’s:

  • corrected Nx (Salzberg et al. 2011)
  • contig path Nx over alignment graph (Earl et al. 2011)
  • normalized N50 (Mäkinen et al. 2012)
  • NAx (where A stands for Aligned) (Gurevich et al. 2013)

All these approaches are pretty similar and require a high quality reference genome of the sequenced organism. We will here take a look at the definition of NAx.

To compute NA50 of an assembly in respect to the reference genome, we perform sequence alignment of contigs to the genome. Contigs containing erroneous joints (misassemblies) are split into aligned blocks and are subsequently aligned independently to distinct fragments of the genome. If we compute the N50 statistic for the set of aligned blocks (rather than initial set of contigs), we obtain the NA50 value. Note that even if some of the contigs fail to align, NA50 is still computed with respect to 50% of the total assembly length (which includes both aligned and unaligned contigs) to be compatible with N50.

Let’s try the NA50 approach using our hypothetical assembly from post #3. We had a Correct Assembly consisting of four contigs 1 Mbp each (Fig. 6a), and an Incorrect Assembly where these four contigs had been misassembled into a single one (Fig. 6b). N50 are 1 Mbp and 4 Mbp, respectively. Since the N50 of the Incorrect Assembly is four times larger, one can easily be fooled to think it’s better, but we will now evaluate this using NA50 instead.

Fig 6a. Correct Assembly with N50 = 1 Mbp and 0 misassemblies.

Fig 6b. Incorrect Assembly obtained by contig merging of the correct Assembly. N50 = 4 Mbp, 3 misassemblies.

We first perform sequence alignment and then calculate N50 for the aligned blocks (instead of the contigs themselves). The N50 value for the aligned blocks in the Correct Assembly is 1 Mbp, because all contigs correctly align to the reference (Fig. 6c). For the Incorrect Assembly, the single 4 Mbp contig splits into four blocks during alignment (Fig. 6d), resulting in an N50 of 1 Mbp of these blocks.

Therefore, the NA50 will be 1 Mbp for both the Correct and the Incorrect Assembly, while the number of misassemblies adds up to 0 and 3 respectively, allowing us to clearly identify the better assembly among these two.

 

Fig 6c. Alignment of the Correct Assembly to the reference genome provides four alignment blocks, 1 Mbp each, resulting in NA50 = 1 Mbp.

Fig 6d. Alignment of the Incorrect Assembly to the reference genome provides the same four alignment blocks of 1 Mbp each, resulting in NA50 = 1 Mbp.

Note that NA50 is always less than or equal to the N50 metric of the same assembly. Just like the NG50 (that we discussed last week), the NA50 metric may be undefined. This happens when the total length of all aligned blocks is less than 50% of the total assembly length.

Now, we can actually try to simultaneously solve both problems of the N50 statistic we’ve discussed, by combining the solutions NG50 and NA50. This new metric is logically called NGA50 and is computed just like the NA50 but taking into account 50% of the reference genome size instead of the total assembly length.

QUAST (Gurevich et al, 2013) is a software which calculates Nx, NGx, NAx, and NGAx statistics at the important levels x=50 and x=75, and also plots the distribution for all x from 0 to 100. QUAST is available as a command-line tool as well as a web-server, making it convenient for all users.

Next week, we’ll take a look at an alternative to the N50 metric that has been proposed for transcriptome assemblies.

This post has been jointly written by Elin Videvall, Andrey Prjibelski, and Alexey Gurevich.

Disclosure: Andrey and Alexey are developers of the QUAST software.


References

Salzberg SL, et al. (2011) GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2011;22:557–567.

Earl et al. 2011. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research. doi: 10.1101/gr.126599.111

Mäkinen V, et al. (2012) Normalized N50 assembly metric using gap-restricted co-linear chaining. BMC Bioinformatics. 2012;13:255.

Gurevich et al. (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29.8 : 1072-1075.

 

Share

About Elin Videvall

Elin is a PhD candidate in the Molecular Ecology and Evolution Lab, Lund University, Sweden. She studies birds and their microbes by analysing genomes, transcriptomes, and microbiomes. You can find her on Twitter: @ElinVidevall
This entry was posted in genomics and tagged , , , , . Bookmark the permalink.