{"id":44,"date":"2014-10-13T16:07:48","date_gmt":"2014-10-13T15:07:48","guid":{"rendered":"http:\/\/www.billconnelly.net\/?p=44"},"modified":"2025-07-05T04:15:34","modified_gmt":"2025-07-05T03:15:34","slug":"interacting-with-pubmed-part-i","status":"publish","type":"post","link":"https:\/\/www.billconnelly.net\/?p=44","title":{"rendered":"Interacting with PubMed. Part I"},"content":{"rendered":"<p>As a scientist, your life&#8217;s work is your publication list. I like to be intimate with mine. Sometimes I just stare at it. I&#8217;d buy it a glass of wine if I could. Maybe even caress it softly. Sure, she ain&#8217;t much to look at, but she&#8217;s mine, and I want to show her off. And if you want a job, you&#8217;re going to want to show yours off too. So I&#8217;m going to show you how to scrape your publication list from Pubmed with Python.<!--more--><\/p>\n<p>When I was a youngster, I used to like writing JavaScript to try and scrape content from webpages in whatever weird way my uneducated little brain could manage. Now I&#8217;m a big boy, I prefer to do it nicely, with HTTP requests and XML parsing. But as you will see in part II of the post, I&#8217;m still not afraid to get a little rough, and go old school on pages that won&#8217;t be obedient. But here, we can be nice and romantic like.<\/p>\n<p>PubMed has an <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/books\/NBK25501\/\" target=\"_blank\" rel=\"noopener\">API<\/a>. But the looks of things it&#8217;s pretty old, and doesn&#8217;t include many of their newer features, but to show your CV, it&#8217;ll work perfect.<\/p>\n<p>The real meat of the script is in two functions. The first is <code>search_pubmed<\/code>. We load up a url with a set of parameters, most importantly the <code>term<\/code> parameter, which is what we actually search. Your name should suffice. This basically returns to us some XML that contains the key to the results, the &#8220;WebEnv&#8221; string. We attach this to our params, use the efetch interface, and we get lovely XML document containing the results.<\/p>\n<pre class=\"prettyprint linenums\">def search_pubmed(term):\r\n    params= {\r\n        &#39;db&#39;: &#39;pubmed&#39;,\r\n        &#39;tool&#39;: &#39;test&#39;,\r\n        &#39;email&#39;:&#39;test@test.com&#39;,\r\n        &#39;term&#39;: term,\r\n        &#39;usehistory&#39;:&#39;y&#39;,\r\n        &#39;retmax&#39;:20\r\n        }\r\n    url = &#39;http:&#47;&#47;eutils.ncbi.nlm.nih.gov&#47;entrez&#47;eutils&#47;esearch.fcgi?&#39; + urllib.urlencode(params)\r\n\r\n    tree = ET.fromstring(urllib.urlopen(url).read())\r\n\r\n    params&#91;&#39;query_key&#39;&#93; = tree.find(&#34;.&#47;QueryKey&#34;).text\r\n    params&#91;&#39;WebEnv&#39;&#93; =  tree.find(&#34;.&#47;WebEnv&#34;).text               \r\n    params&#91;&#39;retmode&#39;&#93; = &#39;xml&#39;\r\n\r\n    url = &#39;http:&#47;&#47;eutils.ncbi.nlm.nih.gov&#47;entrez&#47;eutils&#47;efetch.fcgi?&#39; + urllib.urlencode(params)\r\n    data = urllib.urlopen(url).read()\r\n\r\n    return data\r\n<\/pre>\n<p>The second meaty function <code>xml_to_papers()<\/code> chews up the XML, and puts the appropriate bits into a dictionary, and then appends the dictionary to a list which it then returns. I use the <code>xml.etree.ElementTree<\/code>. In my opinion, it is not half as nice as the <code>BeautifulSoup<\/code> module. However, for some reason I cannot get my server to run custom modules when it serves HTML. I used a dictionary instead of making a custom class. I did this for one simple reason. A dictionary would work. Why write more code than you have to?<\/p>\n<p>What am I doing? Each returned article is hidden in contained between <code>&lt;PubmedArticle&gt;&lt;MedlineCitation&gt;<\/code> tags. I break each of those up with a <code>tree.findall()<\/code> and then iterate over them. Often I pass these sections of XML to helper functions, because PubMed returns different XML tags depending on the data. i.e. <code>&lt;MedlinePgn&gt;<\/code> is not always there, so I need to check if it exists, before I try to get the text it contains.<\/p>\n<pre class=\"prettyprint linenums\">def xml_to_papers(data):\r\n    tree = ET.fromstring(data)\r\n    articles = tree.findall(&#34;.&#47;PubmedArticle&#47;MedlineCitation&#34;)\r\n\r\n    papers = &#91;&#93;\r\n    for article in articles:\r\n        paper = dict()\r\n        paper&#91;&#34;journal_name&#34;&#93; = article.find(&#34;.&#47;Article&#47;Journal&#47;ISOAbbreviation&#34;).text\r\n        paper&#91;&#34;title&#34;&#93; = article.find(&#34;.&#47;Article&#47;ArticleTitle&#34;).text\r\n        paper&#91;&#34;authors&#34;&#93; = digest_authors(article.findall(&#34;.&#47;Article&#47;AuthorList&#47;Author&#34;))\r\n        paper&#91;&#34;issue&#34;&#93; = digest_issue(article.find(&#34;.&#47;Article&#47;Journal&#47;JournalIssue&#34;))\r\n        paper&#91;&#34;year&#34;&#93; = digest_year(article)\r\n        paper&#91;&#34;page_num&#34;&#93; = checkXML(article, &#34;.&#47;Article&#47;Pagination&#47;MedlinePgn&#34;)\r\n        paper&#91;&#34;pmid&#34;&#93; = article.find(&#34;.&#47;PMID&#34;).text\r\n        paper&#91;&#34;doi&#34;&#93; = checkXML(article, &#34;.&#47;Article&#47;ELocationID&#34;)   \r\n        \r\n        papers.append(paper)\r\n\r\n    return papers\r\n<\/pre>\n<p>Finally, I iterate of the papers list and print out appropriate HTML, a step you may or may not want. The output looks like <a href=\"http:\/\/www.billconnelly.net\/?page_id=21\" target=\"_blank\" rel=\"noopener\">this<\/a> and the full script is available <a href=\"http:\/\/www.billconnelly.net\/scripts\/cv-gen.txt\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As a scientist, your life&#8217;s work is your publication list. I like to be intimate with mine. Sometimes I just stare at it. I&#8217;d buy it a glass of wine if I could. Maybe even caress it softly. Sure, she ain&#8217;t much to look at, but she&#8217;s mine, and I want to show her off.&hellip;<a href=\"https:\/\/www.billconnelly.net\/?p=44\">Read more <span class=\"screen-reader-text\">Interacting with PubMed. Part I<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[3,6],"tags":[],"_links":{"self":[{"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=\/wp\/v2\/posts\/44"}],"collection":[{"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=44"}],"version-history":[{"count":11,"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=\/wp\/v2\/posts\/44\/revisions"}],"predecessor-version":[{"id":816,"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=\/wp\/v2\/posts\/44\/revisions\/816"}],"wp:attachment":[{"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=44"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=44"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.billconnelly.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=44"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}