CORE Dataset

The data aggregated from repositories by the CORE system can be accessed in two ways, through the CORE API or by downloading the data to your computer. The former option is practical if you want to build a service on top of CORE while the latter is something we recommend to those who would like to analyse the CORE dataset and possibly apply some computationally intensive batch processes.

Structure of the dumps

The CORE dataset provides access to both the enriched metadata as well as the full-texts. The data dump consists of two files, the metadata file and the content file. Both files are compressed using tar and gzip.

The structure of the metadata file is depicted in the diagram and an example of a metadata item in the data set is as follows:

    "identifier": 13291,
    "ep:Repository": 1,
    "dc:type": [
    "bibo:shortTitle": "Evaluating stillbirths : improving stillbirth data could help make stillbirths a visible public health priority",
    "bibo:AuthorList": [
        "Population Reference Bureau"
    "dc:date": "2007-02",
    "bibo:cites": [
            "rawReferenceText": "Cynthia Stanton. Stillbirth Rates: Delivering Estimates",
            "authors": [

            "bibo:shortTitle": "Stillbirth Rates: Delivering Estimates",
            "doi": "10.1016/S0140-6736(06)68586-3"
    "bibo:citedBy": [

    "similarities": [
            "identifier": 29886,
            "sim:weight": 0.333121,
            "sim:AssociationMethod": "similarity_cosine"
            "identifier": 33044,
            "sim:weight": 0.325861,
            "sim:AssociationMethod": "similarity_cosine"
            "identifier": 43755,
            "sim:weight": 0.173635,
            "sim:AssociationMethod": "similarity_cosine"

The content file has the following structure:

{"identifier":612,"fullTextSource":"Here goes the fulltext ..."}


This dataset has been created from information that was publicly available on the Internet. Every effort has been made to ensure this dataset contains only open access content. We have included only content from repositories and journals that are listed in registries where the condition for inclusion is the provision of content under open access compatible license. However, as metadata are often inconsistent, license information often not machine readable and repositories from time to time leak information that is not open access, we cannot take any responsibility for the license of the content in the dataset. It is therefore up to the user of this dataset to ensure that the way in which they use the dataset does not breach copyright. The dataset is in no way intended for the purposes of reading the original publications, but for machine processing only.


Dump 2015-09
Metadata file (4.5 GB)
Content file (30.5 GB)

Dataset track of DL2014

Dump 2014-06-13 (used for dataset track of DL2014)
Metadata file (3.7 GB)
Content file (24 GB)

Older versions

Dump 2013-12-15
Metadata file (1.7 GB)

Dump 2013-04-12
Metadata file (181 MB)
Metadata file as RDF (835 MB)