View on GitHub

Pubmed parses

Syntactic parses and named entity recognition for PubMed abstracts and PubMed Central full documents

This is a documentation page for a release of syntactic parses and NER for PubMed and PubMed Central Open Access documents. Detailed description of the used processing pipeline is provided in the paper Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute. Available analyses:

Sentence splitting
Tokenization
Part-of-speech tagging
Syntactic parsing (constituent and dependency)
Named entity recognition for genes and gene products (proteins), chemicals, organisms, cell lines and diseases.

News

13.3.2019 Server issues fixed, weekly increments are online again.

12.9.2016 Disconnection and data corruption issues with files hosted in the IDA service have been resolved.

Download

The data set consists of a couple of larger baseline files and periodically added incremental updates. The PubMed abstracts and PMC full documents have been separated and are distributed in different files. All files are in .tar.gz format. The larger baseline files for PubMed and PMC include all documents until the end of 2014 whereas the smaller ones include publications from 2015.

The weekly updates contain documents indexed after the previous update, but before the date mentioned in the file names. Due to occasional errors in the processing pipeline or changes in the PubMed FTP services, some weeks have *_extra* files including documents which have been missed during the original processing.

The baseline files are downloadable from the IDA service:

http://avaa.tdata.fi/openida/dl.jsp?pid=urn:nbn:fi:csc-ida-1x201606212015015490718s (PubMed, -2014, 117GB)
http://avaa.tdata.fi/openida/dl.jsp?pid=urn:nbn:fi:csc-ida-10x201606212015014996728s (PubMed, 2015, 7GB)
http://avaa.tdata.fi/openida/dl.jsp?pid=urn:nbn:fi:csc-ida-9x201609092015017434623s (PMC, -2014, 127GB)
http://avaa.tdata.fi/openida/dl.jsp?pid=urn:nbn:fi:csc-ida-9x201606222015015751883s (PMC, 2015, 62GB)

Weekly incremental updates are available at http://dl.turkunlp.org/pubmed_parses/

Data format

The data set is released in the same XML format used by the Turku Event Extraction System (http://jbjorne.github.io/TEES/) and the XML format is described in TEES documentation https://github.com/jbjorne/TEES/wiki/Interaction-XML . Each document element has an origId attribute which is the actual PubMed or PMC identifier.

The named entities are not stored as entity elements as shown in the TEES documentation but as evex_entity instead. Each of these elements has an attribute entity_type which depicts the entity class. This should not be confused with the attribute type which is necessary for the internal functionality of TEES.

Contact and citation information

Feel free to contact Kai Hakala or Suwisa Kaewphan (first.last@utu.fi) if you have any issues, feedback or requests.

If you use the data, please cite our paper Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute, 2016, Proceedings of the 2016 Workshop on Biomedical Natural Language Processing