The next, next generation: long reads facilitate assembly & annotation in large genome species

Delicious wheat bread. This photograph is not paleo approved. Photo credit Mireya Merritt

Delicious wheat bread. This photograph is not paleo approved. Photo credit Mireya Merritt


The typical procedure for constructing a draft genome or transcriptome using current second generation, high throughput sequencing platforms involves generating short reads about 150 base pairs long, assembling those short reads into larger contigs, putting the contigs in the correct order to create chromosome sequences, and finally annotating protein-coding genes and other elements (for example, introns, transposons, etc). The assembly of contigs can be complicated by a number of factors, particularly if the genome of the species of interest is very large (perhaps due to past genome duplication events), if there are many highly repetitive regions, and/or if there are many highly similar members of multigene families. Ideally, generating full length reads (as opposed to short reads) would help improve assembly of problematic genomic regions, but generating very long sequences is labor intensive.
Enter third generation sequencing! New technology by Pacific Biosystems performs single molecule real-time sequencing (SMRT), generating reads up to 20kb long. There are some concerns about high error rates in PacBio sequences, but these can be corrected for by constructing consensus sequence reads from raw PacBio subreads and/or by aligning PacBio reads to sequences collected from second generation methods.

In their recent BMC Genomics paper, Dong et al. (2015) used the PacBio SMRT platform to sequence the genome of Triticum aestivum, the common wheat. Despite its enormous importance as the most widely cultivated and consumed staple food crop, accounting for 95% of global wheat production, a complete reference genome for this species has not been available, largely due to its large, complex, polyploid genome containing up to 80% repetitive DNA (80%!). An incomplete draft genome based on Illumina HiSeq sequencing and covering about 60% of the genome  was recently published (click here for a paper by The International Wheat Genome Sequencing Consortium). Dong et al. set out to improve upon this draft genome…

[The authors] first identified a population of full-length non-chimeric (FLNC) SMRT cDNA reads from a pooled sample of unfertilized caryopses* and developing grains using Pac-Bio sequencing. Then [they] mapped the reads to the draft genome sequence, and performed an in-depth analysis of the high-quality reads. Finally, [the authors] examined the value of the FLNC reads for finding full-length transcript sequence of the genes encoding three complex families of gluten proteins.
*caryopsis: the fruit of plants in the family Poaceae, which includes wheat, rice, and corn.

The sequencing effort resulted in 197,709 error corrected FLNC, 74.6% of which were estimated to carry complete open reading frame (that is, they contained both start and stop codons). About 10,000 reads could not be mapped to the previously published draft genome and the remaining reads fell into groups that varied in their mapping efficiencies. 134,204 reads (67.88% of the total) could be mapped to one unique location with higher than 90 % coverage and identity. Further quality control (for example, removing reads missing 5′ exons and singleton reads not supported by supplemental RNA-seq data) resulted in 91,881 reads that mapped to 16,188 loci distributed on 21 wheat chromosomes. By searching the 197,709 FLNC reads, Dong et al. identified full-length transcripts for 72 transcribed gluten gene members belonging to three complex gene families and the proportion of FLNC reads with complete gluten gene ORF was 76.8%. In terms of improving the previously sequenced wheat draft genome, data collected by Dong et al. allowed them to annotate 3026 new chromosomal loci. Furthermore, 290 FLNC reads that each mapped to two draft genome contigs from the same chromosome arm can be used for refining chromosomal contig assembly. Taken together, these results demonstrate that combining current, short-read sequencing technologies with third generation long-read sequencing platforms can facilitate the assembly and annotation of complex plant genomes.
Reference:
Lingli Dong, Hongfang Liu, Juncheng Zhang, Shuangjuan Yang, Guanyi Kong, Jeffrey S. C. Chu, Nansheng Chen, and Daowen Wang (2015) Single-molecule real-time transcript sequencing facilitates common wheat genome annotation and grain transcriptome research. BMC Genomics DOI: 10.1186/s12864-015-2257-y

This entry was posted in genomics, next generation sequencing, plants, Uncategorized. Bookmark the permalink.