Commanding around a 70% share of a 1.3 billion USD market, Illumina is the major player in next-generation sequencing (NGS) technology. More likely than not, if you’re a molecular ecologist working with NGS data, you’ve run your samples on a Illumina platform. Until recently, this was probably a HiSeq 1500 or 2500, standard equipment for larger university-based and commercial sequencing facilities. Following its introduction in 2015, however, more and more users have switched to the HiSeq 4000, citing advantages in its increased data output, efficiency, lower cost per run, and the inevitable obsolescence of earlier entries in the HiSeq series. Which is why the results of a preprint posted Sunday alleging this new equipment had a flaw that could result in misidentified sequencing reads spread like wildfire on biology Twitter earlier this week. (As Gavin Sherlock put it: “I think this is a genuine cluster fuck.”)
So, what’s the issue, and how much should you worry? It appears to come down to a change in cluster generation on the new instrument, the step in the Illumina workflow prior to sequencing where libraries are captured and amplified into clonal “clusters.” Originally, Illumina achieved cluster generation by bridge amplification. In this procedure, single stranded DNA libraries bind at one end to oligonucleotide primers on the glass “flow cell” surface, while nonbinding fragments (such as stray primers) are washed free. DNA polymerase then creates a complementary sequence hybridized to the original template, which itself contains a sequence complementary to a second oligonucleotide primer fixed elsewhere on the flow cell. When these regions hybridize, they form an arching “bridge,” and the second primer facilitates synthesization of of the original library molecule. The process repeats through multiple cycles of denaturation and replication until each cluster contains a suitable quantity of DNA for “sequencing by synthesis” (incorporating fluorescent dyes), bound covalently to the flow cell. Because each flow cell contains numerous oligos, multiple libraries can be sequenced simultaneously (or multiplexed), so long as each strand contains a unique index or combination of indices to allow them to be identified back to the proper sample during data processing.
For the HiSeq 4000, the cluster generation step no long features the bind-and-wash procedure. Instead, single stranded DNA libraries are mixed with proprietary reagents on a flow cell patterned with “nanowells,” and subject to rapid isothermal amplification. And therein lies the problem, Sinha et al. claim: “[If] free index primers are present during this procedure, they can prime the library fragments and get extended by the active DNA polymerase, forming a new library molecule with a different index.” This means that because free index primers are no longer washed free of the flow cell, they can end up being incorporated in copies of the wrong library when samples are multiplexed. The alarming outcome, according to the authors, is that as many as 5 – 7% of all sequencing reads can be misassigned during data processing. (After determining that the signal from a highly-expressed gene in their RNAseq data was spreading to wells with a shared row or column, Sinha et al. performed a serious of follow up experiments to demonstrate empty wells with reagents ended up with high quality reads assigned to them, that excess free index primers were likely to blame, and that it wasn’t restricted to a single HiSeq 4000 instrument.)
Before you purge your hard drive of the past two years of your career and break out your old mtDNA primer set, some caveats. First, as with any preprint, it’s worth paying attention to see when and if their research makes it through peer review. (And if other labs can replicate their results.) Second, the not all studies will be affected equally. Non-multiplexed samples will be fine, and dual-indexed samples where each end is unique should also be fine. For studies with standard dual-indexed samples where one end is shared, the read misassignment is probably happening, but unlikely to be affecting your conclusions in most cases. Where the problem is most acute is likely with RNA seq studies where some genes are highly expressed / show a lot of signal (e.g., similar to the work presented in the preprint), and in studies attempting extremely low frequency variant detection, where a handful of erroneously indexed reads could have a big impact on inferences. (Blog posts here and here offer additional perspective on how big a deal this might turn out to be.) We’ll follow up with more information as it becomes available; so far, Illumina has said they’re aware of problem and working to fix it, but little else.
Sinha, R., et al. 2017. Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing. bioRxiv. DOI: 10.1101/125724