Right reads, wrong index? Concerns with data from Illumina’s HiSeq 4000

Commanding around a 70% share of a 1.3 billion USD market, Illumina is the major player in next-generation sequencing (NGS) technology. More likely than not, if you’re a molecular ecologist working with NGS data, you’ve run your samples on a Illumina platform. Until recently, this was probably a HiSeq 1500 or 2500, standard equipment for larger university-based and commercial sequencing facilities. Following its introduction in 2015, however, more and more users have switched to the HiSeq 4000, citing advantages in its increased data output, efficiency, lower cost per run, and the inevitable obsolescence of earlier entries in the HiSeq series. Which is why the results of a preprint posted Sunday alleging this new equipment had a flaw that could result in misidentified sequencing reads spread like wildfire on biology Twitter earlier this week. (As Gavin Sherlock put it: “I think this is a genuine cluster fuck.”)

So, what’s the issue, and how much should you worry? It appears to come down to a change in cluster generation on the new instrument, the step in the Illumina workflow prior to sequencing where libraries are captured and amplified into clonal “clusters.” Originally, Illumina achieved cluster generation by bridge amplification. In this procedure, single stranded DNA libraries bind at one end to oligonucleotide primers on the glass “flow cell” surface, while nonbinding fragments (such as stray primers) are washed free. DNA polymerase then creates a complementary sequence hybridized to the original template, which itself contains a sequence complementary to a second oligonucleotide primer fixed elsewhere on the flow cell. When these regions hybridize, they form an arching “bridge,” and the second primer facilitates synthesization of of the original library molecule. The process repeats through multiple cycles of denaturation and replication until each cluster contains a suitable quantity of DNA for “sequencing by synthesis” (incorporating fluorescent dyes), bound covalently to the flow cell. Because each flow cell contains numerous oligos, multiple libraries can be sequenced simultaneously (or multiplexed), so long as each strand contains a unique index or combination of indices to allow them to be identified back to the proper sample during data processing.

Figure 1 from Sinha et al. 2017, illustrating the mechanism for index misassignment during ExAmp cluster generation.

For the HiSeq 4000, the cluster generation step no long features the bind-and-wash procedure. Instead, single stranded DNA libraries are mixed with proprietary reagents on a flow cell patterned with “nanowells,” and subject to rapid isothermal amplification. And therein lies the problem, Sinha et al. claim: “[If] free index primers are present during this procedure, they can prime the library fragments and get extended by the active DNA polymerase, forming a new library molecule with a different index.” This means that because free index primers are no longer washed free of the flow cell, they can end up being incorporated in copies of the wrong library when samples are multiplexed. The alarming outcome, according to the authors, is that as many as 5 – 7% of all sequencing reads can be misassigned during data processing. (After determining that the signal from a highly-expressed gene in their RNAseq data was spreading to wells with a shared row or column, Sinha et al. performed a serious of follow up experiments to demonstrate empty wells with reagents ended up with high quality reads assigned to them, that excess free index primers were likely to blame, and that it wasn’t restricted to a single HiSeq 4000 instrument.)

Before you purge your hard drive of the past two years of your career and break out your old mtDNA primer set, some caveats. First, as with any preprint, it’s worth paying attention to see when and if their research makes it through peer review. (And if other labs can replicate their results.) Second, the not all studies will be affected equally. Non-multiplexed samples will be fine, and dual-indexed samples where each end is unique should also be fine. For studies with standard dual-indexed samples where one end is shared, the read misassignment is probably happening, but unlikely to be affecting your conclusions in most cases. Where the problem is most acute is likely with RNA seq studies where some genes are highly expressed / show a lot of signal (e.g., similar to the work presented in the preprint), and in studies attempting extremely low frequency variant detection, where a handful of erroneously indexed reads could have a big impact on inferences. (Blog posts here and here offer additional perspective on how big a deal this might turn out to be.) We’ll follow up with more information as it becomes available; so far, Illumina has said they’re aware of problem and working to fix it, but little else.


Sinha, R., et al. 2017. Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing. bioRxiv. DOI: 10.1101/125724


About Ethan Linck

I’m a Ph.D. Candidate at the Department of Biology and the Burke Museum of Natural History, University of Washington, Seattle. My research uses museum specimens and genomic data to analyze and archive avian biodiversity and evolution, particularly in western North America and Melanesia.

This entry was posted in genomics, next generation sequencing, RNAseq, technical, transcriptomics and tagged , , , , . Bookmark the permalink.
  • Lutz Froenicke

    I have posted my thoughts on the problem here:

    It is great that the Biorxiv manuscript from the Weissman lab has
    hopefully identified the cause of the problem which has been reported
    previously for some cases (the agent being “free barcoded primers”
    present in the library).

    However, the scary index-swapping rates were seemingly generated from
    low-quality library preps with high primer contamination. I am aware
    that for their specific application (single-cell SMRT-seq) one can’t be
    choosy with regards to the library quality. For any other application
    these libraries should have failed the QC. The free primers could and
    should have been easily drastically reduced by an additional magnetic
    bead cleanup for their first experiment. It is nice that we get to
    benefit from this mishap now.

    Obviously these data ask for caution for any multiplexed sequencing projects and for protocol adjustments.

    To some degree the manuscript shows: Ugly things can happen if one sequences really ugly libraries.

    How relevant is this to the sequencing of high-quality and clean
    libraries? The observed linear correlation between between primer
    spike-ins and artifact rate indicates, to me, that there is no reason to

  • Nathan R Campbell

    Seems like most libraries would be purified and diluted after quantitation anyway which would wash away and dilute any free unincorporated index primers. Perhaps this is a problem only for specific types of library preps? An additional step of treating with exo1 would remedy the problem for cheap.

    • drphil2

      We’ve been seeing a large increase in “index hopping” over the last 2 years even though our library prep methods haven’t changed since about 2011, and including multiple washes and gel purification of the pooled library. Sometimes >50% of the reads from an index are mis-assigned (when the sample is one of >100, and a small fraction of the total reads). We’re starting double indexing, but a good portion of our data generated over the last year may be lost, not to mention the time we will spend trying to figure out how bad the problem is and whether it is due exclusively to Illumina chemistry issues or other issues as well (some of which cannot be changed, like the quality of DNA from ancient or field-collected samples).

  • Pingback: Update on @illumina index-swapping - Enseqlopedia()

  • Pingback: Update on @illumina index-swapping: better barcode design - Enseqlopedia()