Interacting with PubMed. Part I

As a scientist, your life’s work is your publication list. I like to be intimate with mine. Sometimes I just stare at it. I’d buy it a glass of wine if I could. Maybe even caress it softly. Sure, she ain’t much to look at, but she’s mine, and I want to show her off. And if you want a job, you’re going to want to show yours off too. So I’m going to show you how to scrape your publication list from Pubmed with Python.

When I was a youngster, I used to like writing JavaScript to try and scrape content from webpages in whatever weird way my uneducated little brain could manage. Now I’m a big boy, I prefer to do it nicely, with HTTP requests and XML parsing. But as you will see in part II of the post, I’m still not afraid to get a little rough, and go old school on pages that won’t be obedient. But here, we can be nice and romantic like.

PubMed has an API. But the looks of things it’s pretty old, and doesn’t include many of their newer features, but to show your CV, it’ll work perfect.

The real meat of the script is in two functions. The first is search_pubmed. We load up a url with a set of parameters, most importantly the term parameter, which is what we actually search. Your name should suffice. This basically returns to us some XML that contains the key to the results, the “WebEnv” string. We attach this to our params, use the efetch interface, and we get lovely XML document containing the results.

def search_pubmed(term):
    params= {
        'db': 'pubmed',
        'tool': 'test',
        'email':'test@test.com',
        'term': term,
        'usehistory':'y',
        'retmax':20
        }
    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' + urllib.urlencode(params)

    tree = ET.fromstring(urllib.urlopen(url).read())

    params['query_key'] = tree.find("./QueryKey").text
    params['WebEnv'] =  tree.find("./WebEnv").text               
    params['retmode'] = 'xml'

    url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' + urllib.urlencode(params)
    data = urllib.urlopen(url).read()

    return data

The second meaty function xml_to_papers() chews up the XML, and puts the appropriate bits into a dictionary, and then appends the dictionary to a list which it then returns. I use the xml.etree.ElementTree. In my opinion, it is not half as nice as the BeautifulSoup module. However, for some reason I cannot get my server to run custom modules when it serves HTML. I used a dictionary instead of making a custom class. I did this for one simple reason. A dictionary would work. Why write more code than you have to?

What am I doing? Each returned article is hidden in contained between <PubmedArticle><MedlineCitation> tags. I break each of those up with a tree.findall() and then iterate over them. Often I pass these sections of XML to helper functions, because PubMed returns different XML tags depending on the data. i.e. <MedlinePgn> is not always there, so I need to check if it exists, before I try to get the text it contains.

def xml_to_papers(data):
    tree = ET.fromstring(data)
    articles = tree.findall("./PubmedArticle/MedlineCitation")

    papers = []
    for article in articles:
        paper = dict()
        paper["journal_name"] = article.find("./Article/Journal/ISOAbbreviation").text
        paper["title"] = article.find("./Article/ArticleTitle").text
        paper["authors"] = digest_authors(article.findall("./Article/AuthorList/Author"))
        paper["issue"] = digest_issue(article.find("./Article/Journal/JournalIssue"))
        paper["year"] = digest_year(article)
        paper["page_num"] = checkXML(article, "./Article/Pagination/MedlinePgn")
        paper["pmid"] = article.find("./PMID").text
        paper["doi"] = checkXML(article, "./Article/ELocationID")   
        
        papers.append(paper)

    return papers

Finally, I iterate of the papers list and print out appropriate HTML, a step you may or may not want. The output looks like this and the full script is available here.

6 thoughts on “Interacting with PubMed. Part I

  1. Thanks for sharing the code. I’m a python newbie and have slightly modified your code to take the contents of the clipboard (PMID or part of a title) and return the reference. It works in most cases using Python 2 in PyCharm. However, is there a way of getting the code to parse unicode characters that are sometimes found in the author list or the title of a paper?

    • Undoubtedly there is! Though I’m not 100% sure of what the problem you’re having is. Could you strip your code down to a simplified version? However, in general, if you…
      a = u'\xea\x80\x80abcd\xde\xb4' #a is unicode object
      print a
      >> ꀀabcd޴
      # What you -want- to do is say print a.encode('ascii'). But you literally cant encode some chars in ascii!
      # So what do you do? You can ignore non ascii chars
      print a.encode('ascii','ignore')
      >> 'abcd'

      Hope that helps.

      • Thanks for the reply and the ascii unicode explanation. Here is the Gist of the code that I am using https://gist.github.com/ejmurray/0d5529175bc3b14e87b3. If you copy the PMID “19074528” to the clipboard and run the script it returns the correct reference. When you repeat it using a PMID where the authors names contain accents it doesn’t return the same name (PMID = 19238614). An error is returned if there is a non-ascii character in the title (PMID = 23043183). Last, but not least, I’m trying to get the doi which in the XML is in either the ELocationID tag or ArticleIDList. I’ve had a go at pulling it out with no success. You can tell that I’m still new to python.

        • I’m on it. I don’t get “errors” per se with any of the PMIDs, but yes, I certainly see the issue with the non ascii 19238614. However, 23043183 seems to return perfectly for me:

          AW Harrell, SK Siederer, J Bal, NH Patel, GC Young, CC Felgate, SJ Pearce, AD Roberts, C Beaumont, AJ Emmons, AI Pereira, RD Kempsford.
          Metabolism and disposition of vilanterol, a long-acting β(2)-adrenoceptor agonist for inhalation use in humans.
          http://www.ncbi.nlm.nih.gov/pubmed/23043183
          Drug Metab. Dispos. 2013, 41(1):89-100.

          I’ve been very matlab-centric for the last few months, so it’ll take me a second to warm up my Python brain… but I should have a solution soon.

        • Right so the issue with 19238614 is all in the escapeToHTML function. It is specifically designed so that unicode text can be printed to HTML. Where do you want your text spit out? If that place can handle unicode without error, then simply remove all calls to escapeToHTML. i.e. Change line 125 to

          authorlist += paper["authors"][a]

          The DOI is

          paper["doi"] = article.find("./Article/ELocationID").text

          Hope that helps!

  2. Thanks for taking time to help me with fixing the problems. The DOI and author names are fine now. The final issue with PMID = 23043183 was that when the console window in PyCharm was trying to return the β symbol in the title an error would be displayed. This problem was fixed by updating the config file and getting the output to allow utf-8 encoding – https://goo.gl/TGKtPc.

    Again many thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *