The trouble with PCR duplicates

The sequencing center just sent your lane of Illumina data. You’re excited. Life is great. You begin to process the data. You align the data. You check for PCR duplicates. 50 percent. Half of your data is garbage. Everything is horrible. Life is horrible. How did this happen!?!

PCR duplicates are a headache… if a headache were costing you hundreds/thousands of dollars in wasted sequencing. However, they’re an inevitable part of life when using PCR during Illumina library prep. We can define a PCR duplicate as any two reads that came from the same original DNA fragment. These are a problem because they falsely increase homozygosity. I’ve recently spent way too much time thinking about how these duplicates arise, how we might minimize them, and generally trying to understand what the heck is going on during library prep and sequencing.

In this post I’ll be walking through some RAD data I’ve recently generated (some of these findings could apply to whole genome sequencing, though most of the issues would be far less likely). I’ll focus on the original RAD approach and will assume everyone is somewhat familiar with this method, but see Andrews et al., 2016 for an overview. Hopefully some of this will be useful to others.

There are two main steps in RAD prep that are important in generating PCR duplicates: fragmentation and PCR amplification. During fragmentation DNA is sheared randomly. PCR is then performed on these fragments to incorporate the correct adaptors for sequencing and to generate sufficient library quantity. Typical library prep uses around 10 PCR cycles. This means that the majority of your library consists of reads generated via PCR. That is, basically all of the fragments are identical to at least one other fragment in the library. This means that we will almost certainly sequence some fragments that came from the same starting DNA.

Moreover, the more cycles of PCR, the higher percentage of PCR duplicates you’ll likely see.

With RAD data, these duplicates can be identified after alignment by using the 5’ position of the reverse read. At one restriction site for one individual forward reads are anchored to the restriction site and guaranteed to be identical, regardless of whether or not the read came from different starting fragments. However, the random fragmentation means that reads will be of slightly different lengths. Therefore, reverse reads from unique starting fragments will start (and end) at different locations. Reads with matching forward and reverse reads probably arose from the same starting DNA fragment. This is one big reason I prefer paired end sequencing over single end.

From this point forward, I’m going to talk about RAD data from three different libraries generated using SbfI and the method described in Ali et al. 2016. This method is analogous to the “original RAD” approach, so much of this won’t apply to ezRAD, ddRAD, 2bRAD, etc. These libraries, each with 96 individuals, were ligated with RAD adapters then split into two separate library preps for sonication and Illumina prep (this was done to normalize reads across individuals between two sequencing runs, a detail that isn’t important for the data shown here). I’ll refer to these libraries as library 1, 2, and 3 and each prep as A and B. Library 1A, 1B, 2A, 2B, etc. Prep A used 11 PCR cycles, Prep 2 used 9 PCR cycles. PCR duplicates were marked with samblaster.

Number of reads and PCR duplicates

In theory, if you did one PCR cycle and sequenced every single fragment in your library, 50 percent of your reads would be PCR duplicates. In practice, we don’t sequence every read in our library. But we may expect that the higher the proportion of our reads we sequence, the higher rates of PCR duplicates we may see. This is, indeed, the case. Looking at Figure 1, we see that individuals with more reads have higher levels of PCR duplicates. Similarly, looking between libraries (Figure 2) we see a weak correlation between number of reads and number of duplicates (this effect is not significant; p = 0.13).

Percentage of PCR duplicates for three libraries of 96 individuals with two replicate preps. — Figure 1: Percentage of PCR duplicates for three libraries (1, 2, 3) of 96 individuals with two replicate preps (A and B).

However, between prep A and B, the PCR cycles have less of an effect (Figure 2). Prep A (11 cycles) has higher rates of duplication than prep B (9 cycles), but this is confounded with higher reads. When taking the high number of reads into account, there is no effect of cycle number on duplicate rate.

Accumulation of PCR duplicates during sequencing

I also wanted to understand more about the frequency of the accumulation of PCR duplicates during sequencing. The first read can’t be a duplicate. The second is very unlikely to be a duplicate and each subsequent read will have some increasing probability of being a duplicate (I realize that sequencing is done in parallel, so I’m really thinking more of the order they’re assessed during analysis).

Using data from three individuals, I sorted the aligned bam files by name and marked PCR duplicates. I then went through the reads in order using a 1000 read sliding window to ask what proportion of each 1000 reads were PCR duplicates. This is basically the probability that any new read will be a duplicate of any read that came before it. I then plotted these probabilities to observe how the rate of the accumulation of PCR duplicates changes as reads are assessed.

Looking at Figure 3, you can see that duplicates increase, more or less, linearly as more reads are interrogated. While the total proportion of duplicates is around 0.2, the last read assessed has a much higher probability of being a duplicate than the early reads.

Figure 3: Probability of a read being marked duplicate. Red dashed line shows the proportion of duplicates for each individual.

False positive duplicate identification

Finally, programs to identify PCR duplicates are not designed specifically for RAD-seq. These methods are assuming that unique reads won’t be identical by chance. This assumption is likely to be violated with RAD data, especially because we’re relying on the position of the reverse read alone. It really isn’t all that unlikely that two fragments from the same restriction site will be of the same length, thus falsely looking like duplicates. This isn’t a problem with whole genome sequencing because reads aren’t anchored to specific restriction sites.

I wanted to see how often unique reads were falsely marked as duplicates (false positives). To do this, I used my replicates libraries, removed duplicates within a library, then merged each replicate (so library 1 A and B were merged, etc.). I then marked duplicates again. With this merged dataset, I know that the fragmentation and PCR steps were done independently and all true duplicates have been removed. Because of this, any marked duplicates in the merged library are false positives. I find that an average of 3.8% of reads in the combined libraries are falsely marked as duplicates. I’m not confident that this number is an accurate representation of what actual false positive rates are, but I am confident that false positives are present to some degree. Again, I think this is occurring because of how fragments are anchored to restriction sites; duplicate marking programs just aren’t designed for the lack of variation present in these data.

Conclusions

There are a number of important things that I think can be taken away from these data in regard to PCR duplicates and RAD-seq:

Sequencing efforts have diminishing returns. If you want high depth, it might be worth doing multiple library preps as sequencing will quickly start to produce more duplicates than unique reads.
The number of PCR cycles used probably isn’t all that important. Especially if you’re sequencing at low depths. For most applications I wouldn’t worry too much about minimizing cycle number (though, of course, I’d always prefer fewer).
Some of your PCR duplicates are false positives. To me it seems better to have a low rate of true homozygous reads removed (perhaps increasing false heterozygote genotype calls) than to include many true PCR duplicates (falsely increasing homozygosity). I make this statement based solely on the relative rates of these phenomenon. To solve this, you could use a PCR free method such as ezRAD (with the appropriate PCR free kit) or take advantage of methods that use degenerate bases in adaptors to identify duplicates (for example, Tin et al., 2015). Alternatively, maybe this post will inspire someone to come up with a RAD specific duplicate marking method?

References

Ali, O.A., O’Rourke, S.M., Amish, S.J., Meek, M.H., Luikart, G., Jeffres, C. and Miller, M.R., 2016. RAD capture (Rapture): flexible and efficient sequence-based genotyping. Genetics, 202(2), pp.389-400. doi: 10.1534/genetics.115.183665

Andrews, K.R., Good, J.M., Miller, M.R., Luikart, G. and Hohenlohe, P.A., 2016. Harnessing the power of RADseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17(2), pp.81-92. doi: 10.1038/nrg.2015.28

Tin, M.M.Y., Rheindt, F.E., Cros, E. and Mikheyev, A.S., 2015. Degenerate adaptor sequences for detecting PCR duplicates in reduced representation sequencing data improve genotype calling accuracy. Molecular Ecology Resources, 15(2), pp.329-336. doi: 10.1111/1755-0998.12314