Get your protein

This weekend, I was doing a little work on one of our projects where we are using various cpDNA genes. I really needed to get a number of protein sequences from Genbank for the products of chloroplast genes. If I had a few extra hours on hand and felt like driving myself insane, I could have poked and prodded my way through the various portals to NCBI and used my browser to extract the necessary information from the depths of Genbank et al..
But, for the most part, I’ve got better things to do. And, the web-interface to Genbank can be maddeningly inconsistent across the various databases. So, a reasonable approach to the problem is to gather the data from NCBI programatically.

Here’s what I had to start:

A list of fasta headers from cpDNAs I extracted from Genbank, like so:

gi|91983971|ref|NC_007957.1| Vitis vinifera chloroplast, complete genome
gi|52220789|ref|NC_006290.1| Panax ginseng chloroplast, complete genome
gi|300388125|ref|NC_013991.2| Phoenix dactylifera chloroplast, complete genome
gi|94502469|ref|NC_007977.1| Helianthus annuus chloroplast, complete genome
gi|139390328|ref|NC_009265.1| Aethionema cordifolium chloroplast, complete genome
gi|189162250|ref|NC_010776.1| Fagopyrum esculentum subsp. ancestrale chloroplast, complete genome
gi|68164782|ref|NC_007144.1| Cucumis sativus chloroplast, complete genome
gi|149390334|ref|NC_009601.1| Dioscorea elephantipes chloroplast, complete genome
gi|17570783|ref|NC_003119.6| Medicago truncatula chloroplast, complete genome
gi|116617087|ref|NC_008535.1| Coffea arabica chloroplast, complete genome
gi|283794947|ref|NC_013707.1| Olea europaea chloroplast, complete genome
gi|32399358|emb|AJ428413.1| Calycanthus fertilis var. ferax complete chloroplast genome
gi|114107025|ref|NC_008326.1| Liriodendron tulipifera chloroplast, complete genome
gi|225544104|ref|NC_012224.1| Jatropha curcas chloroplast, complete genome
gi|313199681|ref|NC_014676.1| Theobroma cacao chloroplast, complete genome
gi|309322431|ref|NC_014570.1| Eucalyptus grandis chloroplast, complete genome
gi|115605001|ref|NC_008457.1| Piper cenocladum chloroplast, complete genome
gi|122893969|ref|NC_008796.1| Ranunculus macranthus chloroplast, complete genome
gi|114804244|ref|NC_008359.1| Morus indica chloroplast, complete genome
gi|114329635|ref|NC_008334.1| Citrus sinensis chloroplast, complete genome
gi|159161233|ref|NC_009963.1| Cuscuta exaltata chloroplast, complete genome

A list of the genes in which I was interested

psbA
ndhJ
matK

Now, here’s what I want to do… for each record in species.txt, I want to get the protein sequence of the particular genes in genes.txt. I chose to do this in python, because it is my language of choice (good introductions to the language are here and here, and additional sources can be found in this list). In addition to the standard python library, I will also use the biopython library, which provides functions to interact with NCBI. What follows assumes you have some familiarity with python (see above), although I’ve included lots of comments in the code.
Now, before we get started with the code, some words of caution: NCBI has strict policies regarding programmatic access to NCBI resources. Make SURE you follow these rules or your IP address will be banned, which you don’t want.

we need to import the modules we are going to use and set our email address to something useful
so NCBI can contact us if we are hitting their servers too hard:

import sys
import time
from Bio import Entrez
Entrez.email = "your.email@domain.tld"
if not Entrez.email:
    print "you must add your email address"
    sys.exit(2)

we also need to parse each species.txt file, to give us a list of the species in which we’re interested:

# create an empty list we will fill with the gene names
species = []
# open the file and begin to iterate over the lines
for line in open('species.txt', 'r'):
    # at each line, get the value, strip the line ending
    # (\n), split the string wherever there are '|' characters
    # and add the fourth element (the GI) of the split string
    # to our species list
    name = line.strip('\n').split('|')[3]
    species.append(name)
print "species: ", species

and, we need to parse the gene.txt file, above, to give us the list of genes we want:

# create an empty list we will fill with the gene names
genes = []
# open the file and begin to iterate over the lines
for line in open('genes.txt', 'r'):
    # at each line, get the value, strip the line ending
    # (\n) from the string, and add the gene name
    # to our list
    genes.append(line.strip('\n'))
print "genes: ", genes

Now, to the business end of things. For each gene, for each species, go to NCBI and get the data we’re after

# for each gene in the gene list
for gene in genes:
    # Print out the gene name
    print "gene: ", gene
    # Open a gene-specific output file, to which we will write our
    # results. The string substitution makes the output file name
    # start with the name of the each gene, as we iterate
    # `for gene in genes`
    output = open('{0}.fasta'.format(gene), 'w')
    # for each taxon in the species list
    for taxon in species:
        try:
            # get rid of the taxon version number
            taxon = taxon.split('.')[0]
            # print the taxon name.  The "\t" means print a tab
            #preceding the taxon name
            print "\ttaxon: ", taxon
            # We know the *name* of the gene that we want, and the
            # accession number of the species in which it is
            # found, but we do not know the GI number of the
            # gene (and protein product) we want. So we need to
            # get that first, since it is the best reference to
            # other items in NCBI, like the protein sequence.
            # So, we first build a search  string for an Entrez
            # search query. The result will contain the GI of
            # the given gene in the given species.
            #
            # Substitute the taxon and gene name into the string
            terms = "{0}[accn] AND {1}[Gene]".format(taxon, gene)
            # Use the biopython Entrez class and esearch method to
            # search the Gene db using the terms we've defined
            # above. Entrez Esearch's function is to return
            # primary identifier (GIs) of records. The results
            # are returned as XML.
            result = Entrez.esearch(db = 'Gene', term=terms)
            # Parse the XML using `read()` method of the Entrez
            # class
            record = Entrez.read(result)
            # Get the GI from the first (and only, hopefully)
            # record that was returned.
            gi = record["IdList"][0]
            # Now that we have the GI number, let's get the actual
            # record for the gene which is similar to a page like
            # this (but in XML versus HTML):
            #
            # http://www.ncbi.nlm.nih.gov/gene/4025118
            gene_record = Entrez.efetch(db="gene", id=gi,
                retmode="xml")
            # Parse the XML using `read()` method of the Entrez
            # class
            xml = Entrez.read(gene_record)
            # The `read()` method parses the returned XML to give
            # us many, many nested dictionaries.  This is slightly
            # confusing because the structure returned actually
            # contains lists nested within dictionaries (for a
            # number of reasons). Something like this:
            #
            # xml = {
            #            [
            #                {'record':
            #                    [
            #                        {'name1':value}
            #                    ]
            #                }
            #            ]
            #        }
            #
            # So, what we are doing below is getting the first
            # element of every list then using the "key" that we
            # know (e.g. "Entrezgene_locus") to return the value
            # it is associated with, which is usually another
            # list. Then, we get the first (0th) element of that
            # list, which is a dictionary. Then, we get use the
            # "key" ('Gene-commentary_products') to get the next
            # "value", and so forth.
            products = xml[0]['Entrezgene_locus'][0]\
                             ['Gene-commentary_products'][0]
            # Ensure we've grabbed the correct thing by making
            # sure that the "type" entry of the item we are on is
            # "peptide".
            assert products['Gene-commentary_type']\
                .attributes['value']== 'peptide'
            # Get the accession number for the protein
            prot_accession = products['Gene-commentary_accession']
            # Print the accession number, preceded by two tab
            # characters
            print "\t\taccession: ", prot_accession
            # Get the GI of the protein - this is in a different
            # place that the Accession number - so we traverse
            # the hierarchy again
            prot_gi = xml[0]['Entrezgene_locus'][0]\
                            ['Gene-commentary_products'][0]\
                            ['Gene-commentary_seqs'][0]\
                            ['Seq-loc_whole']['Seq-id']\
                            ['Seq-id_gi']
            print "\t\tprotein GI: ", prot_gi
            # Finally, use the `efetch()` method of Entrez to grab
            # the fasta sequence of the protein we're after.
            seq = Entrez.efetch(db="protein", id = prot_gi,
                retmod='text', rettype='fasta')
            # In one fell swoop, read in the result, and write it
            # out to our gene-specific file
            output.write(seq.read().replace('\n\n', '\n'))
            # Sleep for 0.3 seconds, to keep NCBI servers happy
            time.sleep(0.3)
        except IndexError:
            error = "Unable to locate record.\n"
            print error
            output.write(error)
            # Sleep for 0.3 seconds, to keep NCBI servers happy
            time.sleep(0.3)
    # Flush the buffer (push the results into the file and close
    # the file. This essentially "writes" the results to the file.
    output.close()

Putting it all together, we get the following, which is also available over at github (without the line breaks required here to have pretty formatting):

import sys
import time
from Bio import Entrez
Entrez.email = "your.email@domain.tld"
if not Entrez.email:
    print "you must add your email address"
    sys.exit(2)
# create an empty list we will fill with the gene names
genes = []
# open the file and begin to iterate over the lines
for line in open('genes.txt', 'r'):
    # at each line, get the value, strip the line ending
    # (\n) from the string, and add the gene name
    # to our list
    genes.append(line.strip('\n'))
print "genes: ", genes
# create an empty list we will fill with the gene names
species = []
# open the file and begin to iterate over the lines
for line in open('species.txt', 'r'):
    # at each line, get the value, strip the line ending
    # (\n), split the string wherever there are '|' characters
    # and add the fourth element (the GI) of the split string
    # to our species list
    name = line.strip('\n').split('|')[3]
    species.append(name)
print "species: ", species
# for each gene in the gene list
for gene in genes:
    # Print out the gene name
    print "gene: ", gene
    # Open a gene-specific output file, to which we will write our
    # results. The string substitution makes the output file name
    # start with the name of the each gene, as we iterate
    # `for gene in genes`
    output = open('{0}.fasta'.format(gene), 'w')
    # for each taxon in the species list
    for taxon in species:
        try:
            # get rid of the taxon version number
            taxon = taxon.split('.')[0]
            # print the taxon name.  The "\t" means print a tab
            #preceding the taxon name
            print "\ttaxon: ", taxon
            # We know the *name* of the gene that we want, and the
            # accession number of the species in which it is
            # found, but we do not know the GI number of the
            # gene (and protein product) we want. So we need to
            # get that first, since it is the best reference to
            # other items in NCBI, like the protein sequence.
            # So, we first build a search  string for an Entrez
            # search query. The result will contain the GI of
            # the given gene in the given species.
            #
            # Substitute the taxon and gene name into the string
            terms = "{0}[accn] AND {1}[Gene]".format(taxon, gene)
            # Use the biopython Entrez class and esearch method to
            # search the Gene db using the terms we've defined
            # above. Entrez Esearch's function is to return
            # primary identifier (GIs) of records. The results
            # are returned as XML.
            result = Entrez.esearch(db = 'Gene', term=terms)
            # Parse the XML using `read()` method of the Entrez
            # class
            record = Entrez.read(result)
            # Get the GI from the first (and only, hopefully)
            # record that was returned.
            gi = record["IdList"][0]
            # Now that we have the GI number, let's get the actual
            # record for the gene which is similar to a page like
            # this (but in XML versus HTML):
            #
            # http://www.ncbi.nlm.nih.gov/gene/4025118
            gene_record = Entrez.efetch(db="gene", id=gi,
                retmode="xml")
            # Parse the XML using `read()` method of the Entrez
            # class
            xml = Entrez.read(gene_record)
            # The `read()` method parses the returned XML to give
            # us many, many nested dictionaries.  This is slightly
            # confusing because the structure returned actually
            # contains lists nested within dictionaries (for a
            # number of reasons). Something like this:
            #
            # xml = {
            #            [
            #                {'record':
            #                    [
            #                        {'name1':value}
            #                    ]
            #                }
            #            ]
            #        }
            #
            # So, what we are doing below is getting the first
            # element of every list then using the "key" that we
            # know (e.g. "Entrezgene_locus") to return the value
            # it is associated with, which is usually another
            # list. Then, we get the first (0th) element of that
            # list, which is a dictionary. Then, we get use the
            # "key" ('Gene-commentary_products') to get the next
            # "value", and so forth.
            products = xml[0]['Entrezgene_locus'][0]\
                             ['Gene-commentary_products'][0]
            # Ensure we've grabbed the correct thing by making
            # sure that the "type" entry of the item we are on is
            # "peptide".
            assert products['Gene-commentary_type']\
                .attributes['value']== 'peptide'
            # Get the accession number for the protein
            prot_accession = products['Gene-commentary_accession']
            # Print the accession number, preceded by two tab
            # characters
            print "\t\taccession: ", prot_accession
            # Get the GI of the protein - this is in a different
            # place that the Accession number - so we traverse
            # the hierarchy again
            prot_gi = xml[0]['Entrezgene_locus'][0]\
                            ['Gene-commentary_products'][0]\
                            ['Gene-commentary_seqs'][0]\
                            ['Seq-loc_whole']['Seq-id']\
                            ['Seq-id_gi']
            print "\t\tprotein GI: ", prot_gi
            # Finally, use the `efetch()` method of Entrez to grab
            # the fasta sequence of the protein we're after.
            seq = Entrez.efetch(db="protein", id = prot_gi,
                retmod='text', rettype='fasta')
            # In one fell swoop, read in the result, and write it
            # out to our gene-specific file
            output.write(seq.read().replace('\n\n', '\n'))
            # Sleep for 0.3 seconds, to keep NCBI servers happy
            time.sleep(0.3)
        except IndexError:
            error = "Unable to locate record.\n"
            print error
            output.write(error)
            # Sleep for 0.3 seconds, to keep NCBI servers happy
            time.sleep(0.3)
    # Flush the buffer (push the results into the file and close
    # the file. This essentially "writes" the results to the file.
    output.close()

Get your protein

You may also be interested in

Subscribe by email

Meta

Get your protein

Share this:

You may also be interested in

Subscribe by email

Meta