Hybrid sequencing on hybrids

ResearchBlogging.org

Today I want to talk about an article, recently published in special issue of Molecular Ecology, by Buggs et al. 2010 (Mol Ecol 19: 132-146).  In this paper, the authors used two different next generation methods, pyrosequencing and cyclic reversible terminator sequencing, to understand the fate of duplicated genes in an allopolyploid.
Allopolyploidy occurs when whole-genomes are duplicated. Genes duplicated as a result of allopolyploidy (homeologs) are thought to have different evolutionary fates.

  1. One copy is silenced because it accumulates mutations, while the other maintains its function.
  2. One copy can acquire a novel function and natural selection will preserve this new function, while the other maintains the original function.
  3. Both copies accumulate mutations that reduce their total functioning capacity to that of a single copy (Lynch and Conery 2000).

Three plants were used in the study: Tragopogon miscellus, the allotetraploid, and the two parental diploids species, Tragapogon dubius and Tragopogon pratensis.  Although it has been previously shown that particular parental genes were silenced in T. miscellus, the question here was, does it scale to the genomic level?
The best way to answer this would be to examine the entire set of functional sequences in all three species and ask, in the allopolyploid (T.miscellus) which parental species’ genes were primarily lost?  Because there is no known genome for any of these three species, the authors used what is called hybrid sequencing.
In hybrid sequencing,  Roche 454 is used to generate the de novo transcriptome or expressed genomic sequences from a normalized library of one species (T.dubius).
A side note for those who don’t know what a normalized library means – typically, if you extract total RNA from a cell,  a few genes are abundantly expressed while many more are expressed at very low levels.  Normalizing a library means reducing the sampling of the overabundant transcripts.  This will capture more of the total expressed genome in the sequencing process.
The T.dubius expressed sequences are then used as a “scaffold” or reference sequence.  Once the scaffold of one species was assembled, total mRNA is collected from all three species (T.dubius, T.pratensis, T.miscellus), a non-normalized cDNA library created and then using Illumina sequencing, short reads are generated for all three species.   These short reads are aligned to the reference transcriptome or scaffold.
Let’s explain the sequencing technology used in lay terms .  Both 454 Roche and Illumina take the DNA, break it up into smaller fragments, add adapters onto the end of the shortened DNA, and then immobilize it to a solid surface or support – this represents “the library”.  Because it is mRNA, it needs to be reverse transcribed to cDNA before creating the library.  Old methods of library making involved cloning fragments of DNA into bacterial clones, but sometimes these DNA fragments would be lost.  This doesn’t happen in these “cell-free” systems.
In Roche 454, 1 DNA template (yes, I mean 1) is attached to 1 bead in a droplet of water contained within oil.  Inside the beads are the reagents needed to amplify (via emulsion PCR) a single molecule.  Emulsion PCR results in a population of clonally amplified clusters that arise from that single initial template. The beads with these populations of templates are sitting in wells of a picotiter plate, packed with enzymes that enable sequencing reactions.
So now imagine each bead has a population of clonally amplified templates and the picotiter plate or glass slide has thousands upon thousands of these beads – each simultaneously undergoing a sequencing reaction.  This is what is meant by massive parallel sequencing .  Nucleotides are washed over the plates and when the complementary dNTP is flows into the well, the polymerase extends the primer and a light is emitted and recorded.  The order and intensity of the peaks are recorded and this is what reveals the underlying DNA sequence. Per experiment, at last count 40Mb of data are produced and read lengths average 400bp.  Homopolymer runs (series of As or Ts) and hairpin loops in the DNA cause the most errors in the sequencing reads.
Using 454 Roche, Buggs et al. (2010) were able to get 822 594 reads that were an average of 237bp long.  They took all these reads and then assembled them into 33 515 contigs that were on average 439bp (14.6Mb).  These contigs represent sets of overlapping DNA segments of T.dubius transcriptome – or the expressed portion of the genome.  To provide some context,  of the 125 Mb genome of Arabdopsis  25, 540 genes have been annotated as protein coding (Yamada et al. 2003).  If the average gene length is 2000bp, then this suggests that 40% of the genome is transcriptionally active.   (This is surprisingly low to me especially since 74% of the nonrepeitive sequence of the yeast genome is transcribed. ) How much of the T.dubius transciptome then did the 33 515 contigs capture?  And if there is missing data, wouldn’t using this as a reference genome lead to a fair number of unmapped reads (from the Illumina data) when examining the other species?
It’s hard to know when there is no genomic data to aid assembly.  Buggs et al. did try and map this de novo transcriptome to other species in the Asteraceae and found that 64% of the T.dubius sequences matched previously characterized EST assemblies largely from Lactuca sativa.  Another possible issue created by cDNA sequences are the presence of variation created by alternative splicing.  Furthermore, Buggs et al. only used basal leaf tissue and are unlikely to have captured transcripts expressed in different tissues like flowers, roots, etc. Lastly, the coverage and depth of their 454 sequencing was quite low.
In previous work, Vera et al. (2008) sequenced the transcriptome of the fritillary butterfly (Melitaea cinxia)  using 454 Roche and compared it to the genomic data from Bombyx mori.  They found that the ratio of the length of their individual contigs to the length of the coding region of B.mori increased to 1 (both were the same length) with increasing depth of coverage.  Contigs with at least 10X coverage were the same length as that of the orthologous coding region.  Coverage simply means that any one base pair is covered in 10 different reads.  These authors also suggest that despite normalizing the library, some of the variation in average coverage depth may be due to poorer coverage of the rarer transcripts.  These rarer transcripts may be represented by what are called singleton reads, those that don’t assemble into any particular contig.
In the Buggs et al. work coverage depended on contig length.  Some contigs had a depth of 10-100 while others only had 1-9X coverage.  This is an advantage of parallel sequencing of multiple DNA fragments – you have a way of checking any errors that might be introduced by the sequencing or PCR reactions.
Okay so although we know it may not cover everything, it’s probably sufficient for the question at hand, which is does the silencing of homeologs scale up.  To get to the answer, they used Illumina sequencing to see what happened to the transcripts from all three species.  Because my next post will be on RNA sequencing using Illumina technology, I don’t want to describe cyclic reversible terminator sequencing here, except to say – way cool.
Buggs et al. (2010) generated 7 128 226, 6 840425, and 6 729 215 reads from T. dubius, T.pratensis, and T.miscellus.  Of the pooled reads,  53.4% aligned.  Why so few?  What happened to the other reads?  Is this typical?  Well, in a model organism like yeast, only 56% of the Illumina sequence reads were mapped to unique genomic regions (Nagalakshmi et al. 2008).  So their reads seem to be matching what is expected. But what I would want to know is what proportion of that 53.4% was T.dubius? T.pratensis? or T.miscellus?  Unfortunately because the authors pooled the reads, we don’t know.
Here’s where the technology lends a hand to the evolutionary question – they used these reads to generate thousands of SNPs between the two parental species –  7782 within 2885 contigs, eith each SNP having at least 10X coverage.  Of those, 2989 had representation in the T.miscellus reads.  It turns out that 69% of SNPs showed equal expression in the T.miscellus genome, 22% showed differential expression and 8.5% showed no expression or potential homeolog loss.    Most of the loss or silencing of homeologs came from T.dubius.  Paradoxically, of the SNPs that were differentially expressed in T.miscellus, there was a bias in homeolog expression from T.dubius.
This study, in a matter of a few months, generated 7782 homeolog specific SNP markers within 2885 unique contigs.   Using next generation technology increased the number of genes available to study by two orders of magnitude compared to the “one gene at a time” approach.
In the past, we had a wealth of genetic data on a few laboratory based organisms and a wealth of ecological data on so many other non-model species.  But if it’s possible to generate several thousand SNPs in multiple coding regions across the genome in a matter of few months, then it means that we have the tools to answer questions that could only previously be answered using lab based model systems.  One possibility, of course that comes to mind immediately, is examining the underlying genetic basis of adaptation and ecological interactions in natural populations of non-model organisms.
I feel, however, one challenge with next generation sequencing technologies and the enormous wave of data generated, is to remain firmly grounded in the ecological and evolutionary questions that peaked our curiousity in the first place.  It’s one thing to ride a wave, but a completely different thing to get swept out to sea.
NEXT TIME:  RNA-seq and Illumina’s reversible terminator
Works Cited:
Buggs RJ, Chamala S, Wu W, Gao L, May GD, Schnable PS, Soltis DE, Soltis PS, & Barbazuk WB (2010). Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next-generation sequencing and Sequenom iPLEX MassARRAY genotyping. Molecular ecology, 19 Suppl 1, 132-46 PMID: 20331776
Tautz et al. 2010.  Next generation Molecular Ecology.  Molecular Ecology 19,s1: 1-3.
Vera et al. 2008. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing.  Molecular Ecology 17: 1636-1647.
Yamada et al. 2003.  Empirical analysis of transcriptional activity in the Arabidopsis genome.  Science 302:842-846.
Nagalaksmi et al. 2008.  The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344-1349.
Marguiles et al. 2005 Genome sequencing in microfabricated high-density picolitre reactors.  Nature 437:376-380.
Metzker, M. 2010.  Sequencing technologies – the next generation.  Nature Reviews Genetics 11: 31-46.
Lynch and Connery 2000. The evolutionary fate and consequences of duplicated genes.  Science 290:1151 – 1155
Molecular Ecology has a special supplemental issue called Next Generation Molecular Ecology.  Contributors examine how next generation sequencing is used in the fields of:  taxonomy, single nucleotide polymorphism (SNP) detection and analysis, expression studies, and tracing adaptations and bioinformatic developments.

This entry was posted in next generation sequencing. Bookmark the permalink.