Finding homologous genetic regions (let’s ignore the homolog, ortholog, paralog distinction) across “genome-enabled” organisms is a handy thing to know how to do. Yet, sometimes this task appears harder than it really is, particularly given the exceptional resources that exist to make life easier… like the UCSC Genome Browser. Below, I”ll lay out, in a set of screenshots, how you might go about such a task.
Let’s assume the following scenario: you have a gene in which you are interested (myh6 in zebrafish, perhaps). You would like to find the putative location of this gene in another organism using sequence similarity searches – perhaps for the purpose of aligning homologous regions across several organisms for any number of purposes.
So, the question is, how do you do that? The answers are many. Below is one.
Because we’re talking about genes, we need to remember that genes consist of introns and exons. Exons are essentially what make up the expressed portion of the gene. This expressed portion, for simplicity, we’ll call mRNA. Because changes to genes (and mRNA) have important consequences, exons are more conserved across organisms (relative to introns/intergenic regions/etc.). Thus, since exons are the “business-end” of genes, they may be more conserved across organisms, and exons make up mRNA, we are going to proceed by finding the mRNA sequence of myh6.
Then, we are going to use a gapped alignment to align this mRNA sequence to the genomic DNA sequences of another organism (stickleback!). Finally, we’re going to grab this DNA region so we can do stuff with it (what you do is entirely your business – maybe I’ll cover some options in the future)
Get thee to the genome browser…
Now that you’re there, let’s find myh6 by filling in the “gene” box with “myh6”:
If all went well, the genome browser should fire you on over to the genomic position of myh6 in the zebrafish genome:
Note that the browser shows you where the gene is positioned within the genome. Feel free to zoom in and out, check out different options, and generally poke around. You can always return here by following the above steps.
Get the mRNA sequence
As I mentioned above, we’re going to use the mRNA sequence that is associated with myh6 as the means to locate myh6 in other organisms. So, we should probably get the mRNA sequence. As the above screenshot shows, just click on the purple color that represents the myh6 gene. This should take you to the following page:
Conveniently located 4/5ths down the page is a the link to the myh6 mRNA sequence. Click that and BEHOLD (!):
Copy this sequence somewhere (a text file, a word document, etc.).
Digressions are the spice of life
Because you get the NCBI Accession/RefSeq ID in the header of the mRNA you just requested, you can head over to NCBI Entrez and punch that bad-boy into the box (meaning type the accession number into the box):
If you click on the “Nucleotide” link, you can view the actual nucleotide record for the reference sequence (RefSeq):
And, if you click on the “Gene” link (back at NCBI Entrez), you can view handy annotation data relative to myh6 including GO terms, citations, similar mRNAs and ESTs, etc.:
Searching for… something
Now that we have the mRNA sequence for myh6, we are going to try and align that to the genomic DNA of another organism, in this case another fish: stickleback (lovely creature). To do this, we are going to run a gapped alignment using BLAT. Go to the BLAT page at the UCSC Genome Browser. Then, paste the mRNA sequence from earlier into the box, and make sure that “Stickleback” is selected in the pulldown to the left, under “Genome:” (you can certainly try other creatures, but you might want to stick with fish):
Run the search by clicking “Submit”. You should get the following output:
There are several options and lots of data here. Feel free to play around. More information on BLAT and several of its parameters is in the FAQblat. Bottom line is that the best match here, by far, is the first match in the list. For the sake of simpicity, I’ll assume that this match is both “real” and “correct” – meaning that this first match is the homolog of myh6 in stickleback. You should never assume this without giving some thought to your assumptions.
Map it out
If you select the “browser” link to the left of the search results, you’ll see a visual representation of the search results (here, I have selected the first, and best, result):
The gray blocks (that actually appear mostly red) show the alignment blocks of zebrafish mRNA to stickleback DNA, and the red hash-marks show mismatched bases (zoom in to see this better).
Grab what you need
At the outset, I said we were interested in grabbing the DNA sequence of this putative homolog – perhaps for alignment, perhaps for primer design, perhaps for fun. In order to quickly grab the DNA sequence corresponding to this match, find the DNA link:
Once you click on that, you’ll be taken to the “Get DNA for” page. On this page, you can extract a piece of genomic DNA (the coordinates should be filled, automagically, to your current view), pad the region you are excising (I like to pad all “genes” by 100 bp or so, particularly if I’m going to align the resulting DNA, anyway), and choose whether or not you want the result to be masked (you probably do):
Once you click “get DNA”, you’ll be presented with the following (your DNA!):
If you are pulling homologous sequences for alignment, you basically rinse and repeat – starting a new search in a new organism using the mRNA sequence from zebrafish and pasting the resulting DNA sequences into
The title of this post is an off-hand reference to the rather awesome "Learn You a Haskell for Great Good" which is part of a genre of kooky (and fun) technical manuals, likely started by "Why's (Poignant) Guide to Ruby"