Get your protein

This weekend, I was doing a little work on one of our projects where we are using various cpDNA genes. I really needed to get a number of protein sequences from Genbank for the products of chloroplast genes. If I had a few extra hours on hand and felt like driving myself insane, I could have poked and prodded my way through the various portals to NCBI and used my browser to extract the necessary information from the depths of Genbank et al..
But, for the most part, I’ve got better things to do. And, the web-interface to Genbank can be maddeningly inconsistent across the various databases. So, a reasonable approach to the problem is to gather the data from NCBI programatically.

Here’s what I had to start:

  1. A list of fasta headers from cpDNAs I extracted from Genbank, like so:
  2. gi|91983971|ref|NC_007957.1| Vitis vinifera chloroplast, complete genome
    gi|52220789|ref|NC_006290.1| Panax ginseng chloroplast, complete genome
    gi|300388125|ref|NC_013991.2| Phoenix dactylifera chloroplast, complete genome
    gi|94502469|ref|NC_007977.1| Helianthus annuus chloroplast, complete genome
    gi|139390328|ref|NC_009265.1| Aethionema cordifolium chloroplast, complete genome
    gi|189162250|ref|NC_010776.1| Fagopyrum esculentum subsp. ancestrale chloroplast, complete genome
    gi|68164782|ref|NC_007144.1| Cucumis sativus chloroplast, complete genome
    gi|149390334|ref|NC_009601.1| Dioscorea elephantipes chloroplast, complete genome
    gi|17570783|ref|NC_003119.6| Medicago truncatula chloroplast, complete genome
    gi|116617087|ref|NC_008535.1| Coffea arabica chloroplast, complete genome
    gi|283794947|ref|NC_013707.1| Olea europaea chloroplast, complete genome
    gi|32399358|emb|AJ428413.1| Calycanthus fertilis var. ferax complete chloroplast genome
    gi|114107025|ref|NC_008326.1| Liriodendron tulipifera chloroplast, complete genome
    gi|225544104|ref|NC_012224.1| Jatropha curcas chloroplast, complete genome
    gi|313199681|ref|NC_014676.1| Theobroma cacao chloroplast, complete genome
    gi|309322431|ref|NC_014570.1| Eucalyptus grandis chloroplast, complete genome
    gi|115605001|ref|NC_008457.1| Piper cenocladum chloroplast, complete genome
    gi|122893969|ref|NC_008796.1| Ranunculus macranthus chloroplast, complete genome
    gi|114804244|ref|NC_008359.1| Morus indica chloroplast, complete genome
    gi|114329635|ref|NC_008334.1| Citrus sinensis chloroplast, complete genome
    gi|159161233|ref|NC_009963.1| Cuscuta exaltata chloroplast, complete genome
    
  3. A list of the genes in which I was interested
  4. psbA
    ndhJ
    matK
    

Now, here’s what I want to do… for each record in species.txt, I want to get the protein sequence of the particular genes in genes.txt. I chose to do this in python, because it is my language of choice (good introductions to the language are here and here, and additional sources can be found in this list). In addition to the standard python library, I will also use the biopython library, which provides functions to interact with NCBI. What follows assumes you have some familiarity with python (see above), although I’ve included lots of comments in the code.
Now, before we get started with the code, some words of caution: NCBI has strict policies regarding programmatic access to NCBI resources. Make SURE you follow these rules or your IP address will be banned, which you don’t want.

  1. we need to import the modules we are going to use and set our email address to something useful
    so NCBI can contact us if we are hitting their servers too hard:

    import sys
    import time
    from Bio import Entrez
    Entrez.email = "your.email@domain.tld"
    if not Entrez.email:
        print "you must add your email address"
        sys.exit(2)
    
  2. we also need to parse each species.txt file, to give us a list of the species in which we’re interested:
    # create an empty list we will fill with the gene names
    species = []
    # open the file and begin to iterate over the lines
    for line in open('species.txt', 'r'):
        # at each line, get the value, strip the line ending
        # (\n), split the string wherever there are '|' characters
        # and add the fourth element (the GI) of the split string
        # to our species list
        name = line.strip('\n').split('|')[3]
        species.append(name)
    print "species: ", species
    
  3. and, we need to parse the gene.txt file, above, to give us the list of genes we want:
    # create an empty list we will fill with the gene names
    genes = []
    # open the file and begin to iterate over the lines
    for line in open('genes.txt', 'r'):
        # at each line, get the value, strip the line ending
        # (\n) from the string, and add the gene name
        # to our list
        genes.append(line.strip('\n'))
    print "genes: ", genes
    
  4. Now, to the business end of things. For each gene, for each species, go to NCBI and get the data we’re after
    # for each gene in the gene list
    for gene in genes:
        # Print out the gene name
        print "gene: ", gene
        # Open a gene-specific output file, to which we will write our
        # results. The string substitution makes the output file name
        # start with the name of the each gene, as we iterate
        # `for gene in genes`
        output = open('{0}.fasta'.format(gene), 'w')
        # for each taxon in the species list
        for taxon in species:
            try:
                # get rid of the taxon version number
                taxon = taxon.split('.')[0]
                # print the taxon name.  The "\t" means print a tab
                #preceding the taxon name
                print "\ttaxon: ", taxon
                # We know the *name* of the gene that we want, and the
                # accession number of the species in which it is
                # found, but we do not know the GI number of the
                # gene (and protein product) we want. So we need to
                # get that first, since it is the best reference to
                # other items in NCBI, like the protein sequence.
                # So, we first build a search  string for an Entrez
                # search query. The result will contain the GI of
                # the given gene in the given species.
                #
                # Substitute the taxon and gene name into the string
                terms = "{0}[accn] AND {1}[Gene]".format(taxon, gene)
                # Use the biopython Entrez class and esearch method to
                # search the Gene db using the terms we've defined
                # above. Entrez Esearch's function is to return
                # primary identifier (GIs) of records. The results
                # are returned as XML.
                result = Entrez.esearch(db = 'Gene', term=terms)
                # Parse the XML using `read()` method of the Entrez
                # class
                record = Entrez.read(result)
                # Get the GI from the first (and only, hopefully)
                # record that was returned.
                gi = record["IdList"][0]
                # Now that we have the GI number, let's get the actual
                # record for the gene which is similar to a page like
                # this (but in XML versus HTML):
                #
                # http://www.ncbi.nlm.nih.gov/gene/4025118
                gene_record = Entrez.efetch(db="gene", id=gi,
                    retmode="xml")
                # Parse the XML using `read()` method of the Entrez
                # class
                xml = Entrez.read(gene_record)
                # The `read()` method parses the returned XML to give
                # us many, many nested dictionaries.  This is slightly
                # confusing because the structure returned actually
                # contains lists nested within dictionaries (for a
                # number of reasons). Something like this:
                #
                # xml = {
                #            [
                #                {'record':
                #                    [
                #                        {'name1':value}
                #                    ]
                #                }
                #            ]
                #        }
                #
                # So, what we are doing below is getting the first
                # element of every list then using the "key" that we
                # know (e.g. "Entrezgene_locus") to return the value
                # it is associated with, which is usually another
                # list. Then, we get the first (0th) element of that
                # list, which is a dictionary. Then, we get use the
                # "key" ('Gene-commentary_products') to get the next
                # "value", and so forth.
                products = xml[0]['Entrezgene_locus'][0]\
                                 ['Gene-commentary_products'][0]
                # Ensure we've grabbed the correct thing by making
                # sure that the "type" entry of the item we are on is
                # "peptide".
                assert products['Gene-commentary_type']\
                    .attributes['value']== 'peptide'
                # Get the accession number for the protein
                prot_accession = products['Gene-commentary_accession']
                # Print the accession number, preceded by two tab
                # characters
                print "\t\taccession: ", prot_accession
                # Get the GI of the protein - this is in a different
                # place that the Accession number - so we traverse
                # the hierarchy again
                prot_gi = xml[0]['Entrezgene_locus'][0]\
                                ['Gene-commentary_products'][0]\
                                ['Gene-commentary_seqs'][0]\
                                ['Seq-loc_whole']['Seq-id']\
                                ['Seq-id_gi']
                print "\t\tprotein GI: ", prot_gi
                # Finally, use the `efetch()` method of Entrez to grab
                # the fasta sequence of the protein we're after.
                seq = Entrez.efetch(db="protein", id = prot_gi,
                    retmod='text', rettype='fasta')
                # In one fell swoop, read in the result, and write it
                # out to our gene-specific file
                output.write(seq.read().replace('\n\n', '\n'))
                # Sleep for 0.3 seconds, to keep NCBI servers happy
                time.sleep(0.3)
            except IndexError:
                error = "Unable to locate record.\n"
                print error
                output.write(error)
                # Sleep for 0.3 seconds, to keep NCBI servers happy
                time.sleep(0.3)
        # Flush the buffer (push the results into the file and close
        # the file. This essentially "writes" the results to the file.
        output.close()
    
  5. Putting it all together, we get the following, which is also available over at github (without the line breaks required here to have pretty formatting):
    import sys
    import time
    from Bio import Entrez
    Entrez.email = "your.email@domain.tld"
    if not Entrez.email:
        print "you must add your email address"
        sys.exit(2)
    # create an empty list we will fill with the gene names
    genes = []
    # open the file and begin to iterate over the lines
    for line in open('genes.txt', 'r'):
        # at each line, get the value, strip the line ending
        # (\n) from the string, and add the gene name
        # to our list
        genes.append(line.strip('\n'))
    print "genes: ", genes
    # create an empty list we will fill with the gene names
    species = []
    # open the file and begin to iterate over the lines
    for line in open('species.txt', 'r'):
        # at each line, get the value, strip the line ending
        # (\n), split the string wherever there are '|' characters
        # and add the fourth element (the GI) of the split string
        # to our species list
        name = line.strip('\n').split('|')[3]
        species.append(name)
    print "species: ", species
    # for each gene in the gene list
    for gene in genes:
        # Print out the gene name
        print "gene: ", gene
        # Open a gene-specific output file, to which we will write our
        # results. The string substitution makes the output file name
        # start with the name of the each gene, as we iterate
        # `for gene in genes`
        output = open('{0}.fasta'.format(gene), 'w')
        # for each taxon in the species list
        for taxon in species:
            try:
                # get rid of the taxon version number
                taxon = taxon.split('.')[0]
                # print the taxon name.  The "\t" means print a tab
                #preceding the taxon name
                print "\ttaxon: ", taxon
                # We know the *name* of the gene that we want, and the
                # accession number of the species in which it is
                # found, but we do not know the GI number of the
                # gene (and protein product) we want. So we need to
                # get that first, since it is the best reference to
                # other items in NCBI, like the protein sequence.
                # So, we first build a search  string for an Entrez
                # search query. The result will contain the GI of
                # the given gene in the given species.
                #
                # Substitute the taxon and gene name into the string
                terms = "{0}[accn] AND {1}[Gene]".format(taxon, gene)
                # Use the biopython Entrez class and esearch method to
                # search the Gene db using the terms we've defined
                # above. Entrez Esearch's function is to return
                # primary identifier (GIs) of records. The results
                # are returned as XML.
                result = Entrez.esearch(db = 'Gene', term=terms)
                # Parse the XML using `read()` method of the Entrez
                # class
                record = Entrez.read(result)
                # Get the GI from the first (and only, hopefully)
                # record that was returned.
                gi = record["IdList"][0]
                # Now that we have the GI number, let's get the actual
                # record for the gene which is similar to a page like
                # this (but in XML versus HTML):
                #
                # http://www.ncbi.nlm.nih.gov/gene/4025118
                gene_record = Entrez.efetch(db="gene", id=gi,
                    retmode="xml")
                # Parse the XML using `read()` method of the Entrez
                # class
                xml = Entrez.read(gene_record)
                # The `read()` method parses the returned XML to give
                # us many, many nested dictionaries.  This is slightly
                # confusing because the structure returned actually
                # contains lists nested within dictionaries (for a
                # number of reasons). Something like this:
                #
                # xml = {
                #            [
                #                {'record':
                #                    [
                #                        {'name1':value}
                #                    ]
                #                }
                #            ]
                #        }
                #
                # So, what we are doing below is getting the first
                # element of every list then using the "key" that we
                # know (e.g. "Entrezgene_locus") to return the value
                # it is associated with, which is usually another
                # list. Then, we get the first (0th) element of that
                # list, which is a dictionary. Then, we get use the
                # "key" ('Gene-commentary_products') to get the next
                # "value", and so forth.
                products = xml[0]['Entrezgene_locus'][0]\
                                 ['Gene-commentary_products'][0]
                # Ensure we've grabbed the correct thing by making
                # sure that the "type" entry of the item we are on is
                # "peptide".
                assert products['Gene-commentary_type']\
                    .attributes['value']== 'peptide'
                # Get the accession number for the protein
                prot_accession = products['Gene-commentary_accession']
                # Print the accession number, preceded by two tab
                # characters
                print "\t\taccession: ", prot_accession
                # Get the GI of the protein - this is in a different
                # place that the Accession number - so we traverse
                # the hierarchy again
                prot_gi = xml[0]['Entrezgene_locus'][0]\
                                ['Gene-commentary_products'][0]\
                                ['Gene-commentary_seqs'][0]\
                                ['Seq-loc_whole']['Seq-id']\
                                ['Seq-id_gi']
                print "\t\tprotein GI: ", prot_gi
                # Finally, use the `efetch()` method of Entrez to grab
                # the fasta sequence of the protein we're after.
                seq = Entrez.efetch(db="protein", id = prot_gi,
                    retmod='text', rettype='fasta')
                # In one fell swoop, read in the result, and write it
                # out to our gene-specific file
                output.write(seq.read().replace('\n\n', '\n'))
                # Sleep for 0.3 seconds, to keep NCBI servers happy
                time.sleep(0.3)
            except IndexError:
                error = "Unable to locate record.\n"
                print error
                output.write(error)
                # Sleep for 0.3 seconds, to keep NCBI servers happy
                time.sleep(0.3)
        # Flush the buffer (push the results into the file and close
        # the file. This essentially "writes" the results to the file.
        output.close()
    
This entry was posted in bioinformatics, howto, methods. Bookmark the permalink.