CORE Dataset

How to download the Dataset

We recommend that you download the files using CURL or WGET

wget -c (url of the data dump ending in .tar.gz)
curl -L -O -C - (url of the data dump ending in .tar.gz)

For bigger datasets, you might want to consider using dedicated tools such as aria2 https://aria2.github.io/.

For the CURL request, notice the hyphen after the –C and before the .

If the download is halted, you can resume downloading by re-running the same command.

Reminder:

CORE’s data is changing all the time and the Dataset download is only complete at the moment of generation. This means that this Dataset download is a snap-shot in time from the date it was created, and may not contain the most recent data.

Latest datasets:

Dataset 2023

680 GB - compressed
~4.8 TB - extracted

Full dataset (Full text & metadata).

md5: 5f9688c52a92d0225cc064354d02ea04

Note: Please read the section "Structure of datasets" below for more information.

Other datasets:

CORE-MAG mapping 2019-04-01

CORE Dataset to Microsoft Academic Graph mapping

173 MB - in total
80 MB - compressed
8.9M - matches

md5: 9215a3f6a91b54bc276b50605fae2ccf

License: Open Data Commons Attribution (ODC-By) license.

Note: Please read the section Structure of datasets below for more information.

Deduplication Dataset 2020

Сreated for Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings (LREC 2020)

204 MB - in total
62 MB - compressed

md5: 9215a3f6a91b54bc276b50605fae2ccf

License: Open Data Commons Attribution (ODC-By) license.

Note: Please read the section Structure of datasets below for more information.

Archived datasets:

Dataset 2022

Full dataset (Full text & metadata)

393 GB - compressed
3.5 TB - extracted

Full dataset (Full text & metadata)

Please read the section Structure of datasets below for more information.

Dataset 2020-03-18

Full dataset (Full text & metadata)

400 GB - compressed
2.1 TB - extracted

md5: 9215a3f6a91b54bc276b50605fae2ccf

Note: Please read the section Structure of datasets below for more information.

Dataset 2018-03-01

Full text only

330 GB - in total

md5: 9215a3f6a91b54bc276b50605fae2ccf

License: Open Data Commons Attribution (ODC-By) license.

Note: Please read the section Structure of datasets below for more information.

Dataset 2018-03-01

Metadata only text

127 GB - in total

md5: f70fa05dbb484bf85444e8bc2d5c4319

License: Open Data Commons Attribution (ODC-By) license.

Note: Please read the section Structure of datasets below for more information.

Dataset 2017-11-01

Full text dataset

157.38 GB - in total

md5: bbb9e83f80ceeaf44baf167b7928fe47

License: Open Data Commons Attribution (ODC-By) license.

Note: Please read the section Structure of datasets below for more information.

Dataset 2017-11-01

Metadata only dataset

22.65 GB - in total

md5: f70fa05dbb484bf85444e8bc2d5c4319

License: Open Data Commons Attribution (ODC-By) license.

Note: Please read the section Structure of datasets below for more information.

Structure of datasets

Structure of dataset 2020

The CORE dump implements the approach of the ResourceSync Framework Resource Dump standard.

Note that this is an extremely large file (∼395GB) and appropriate tools are necessary for downloading it. Once extracted it will use about 2.1TB of filesystem.

Perform the extraction by running:

tar -xf resync_dump.tar.xz -C /target/directory

The previous steps will extract the big archive in multiple smaller files. Each archive contains all the resources for a specific CORE data provider the full list which you can find at our data providers page.

The following command extracts every single archive in the appropriate folder.

#!/bin/bash

for FILE in `ls -1 tmp/*.tar.xz`;
do
        PROVIDER="${FILE%.*.*}"
        echo $PROVIDER
        echo $FILE
        mkdir -p output/$PROVIDER
	tar xf $FILE -C output/$PROVIDER/
done

Replace PROVIDER with the ID of every single archive.

The extracted folder generated in step 4, is a two-level deep file structure and includes a Manifest named manifest.xml file in the root, which lists the resources. Below is the format of a single entry in the manifest which lists the available resources:

<url>
  <loc>https://core.ac.uk/api-v2/articles/get/132196135</loc>
  <rs:md
    hash="md5:39127e4b3b76fc5a66f3eabee28ab71f"
    length="3759"
    type="application/json"
    path="/182/a2/132196135.json"
  />
</url>

The url inside the <loc></loc> tags is the ID of the file that can be used for tracking future updates on the resource. The path attribute is where the file can be found in the folder structure, and in order to validate the file, a md5 checksum and the file size are also provided.

This is a sample data structure from the Dataset

{
  "doi": DOI,
  "coreId": "228783",
  "oai": OAI_IDENTIFIER,
  "identifiers": [ADDITIONAL IDENTIFIERS],
  "title": "TITLE",
  "authors": ["AUTHOR1", "AUTHOR2"],
  "enrichments": {
    "references": [REFERENCES],
    "documentType": {
      "type": "RESEARCH|THESIS|PRESENTATION",
      "confidence": CONFIDENCE
    },
    "citationCount": COUNT
  },
  "contributors": [CONTRIBUTORS],
  "datePublished": "DATE OR YEAR",
  "abstract": "ABSTRACT",
  "downloadUrl": DOWNLOAD URL IF AVAILABLE,
  "fullTextIdentifier": FULL TEXT ID IF AVAILABLE,
  "pdfHashValue": HASH OF THE PDF IF AVAILABLE,
  "publisher": PUBLISHER,
  "rawRecordXml": "XML RECORD",
  "journals": [JOURNALS],
  "language": {
    "code": "COUNTRY CODE",
    "name": "LANGUAGE NAME",
    "id": ID
  },
  "relations": ["URLs WITH RELATIONS"],
  "year": PUBLICATION YEAR,
  "topics": ["TOPIC1","TOPIC2" ],
  "subjects": ["SUBJECT1", "SUBJECT2"],
  "issn": "ISSN-IDENTIFIER",
  "fullText": "FULL TEXT"
}

Fields description

Field name	Description
doi	[Digital Object Identifier](https://www.doi.org). A persistent and unique identifier for the document. This data is collected from the data provider or discovered by enrichment processes by CORE using [Crossref](https://crossref.org) and other DOIs collections.
coreId	The persistent identifier of a document in the CORE infrastructure.
oai	The identifier of a resource harvested from a repository. It usually contains a static part identifying the data provider and a variable part identifying the single record. It is originated by data provider using the OAI-PMH protocol but if the data provider is not using it, CORE will generate one for the record.
identifiers	A list of identifiers for the document, it might contains urls, PMC IDs, DOI etc. This information is collected from the data provider (`dc.identifier` tag) and enriched by CORE.
title	The title of the document
authors	An array containing the list of authors.
enrichments	This sub-object contains enrichments to the data harvested from the data provider.
enrichments.references	A list of references (other documents) discovered by CORE.
enrichments.documentType	The type of the document. We use a machine learning algorithm to discover the document type, the type has also a confidence associated.
enrichments.citationCount	The count of papers citing the paper. This information is extracted via [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/).
contributors	Matches the `dc.contributors` tag in the Dublin Core metadata format.
datePublished	Date of when the document has been published. If the data is not available from the original data provider, CORE will try to discover this using other data sources.
abstract	The abstract of the document
downloadUrl	The url where the full text is available. If the full text is hosted in CORE this will be a CORE url, otherwise it will be a url to a different data source.
fullTextIdentifier	This url is the location where CORE managed to find the hosted full text.
pdfHashValue	An hash value of the pdf, to validate the integrity of a document and test for duplicates and changes.
publisher	Coming from `dc.publisher`
rawRecordXml	left-aligned
journals	Sub object containing metadata about the journal where the record has been publish.
language	Language of the record discovered by CORE.
relations	Coming from `dc.relations`
year	Based on the different dates available for the record, this field contains the year on which this document has been published. It uses only year because data quality is variable and many document don't have detailed informations.
topics	Coming from `dc.topic`.
subjects	Coming from `dc.subject`
issn	The issn of the journal where the article was published on. This information is extracted from the [Crossref](https://crossref.org) data.
fullText	The text extracted from the hosted full text.

Structure of dataset 2018 onwards

The downloadable tar file contains XZ compressed files of Article Metadata. The XZ compressed file is a file named [repositoryID].json.xz. Once decompressed, each line in the text file contains the metadata for 1 article in JSON.

We chose the xz format due to a better compression ratio vs bzip2 or gzip. The downside is the tools are not always installed by default.
Most Linux distributions have xz available for installation in the default package manager. Mac users can install xz via Brew or MacPorts and there are many other free alternatives. Windows users can use 7-zip. If you have any trouble extracting the files, please contact us.

Please note that each JSON file is not valid JSON however, each line is. Each line is delimited using a Windows formatted newline (\r\n).

The dump structure has changed to following format:

{
  "doi": str|None,
  "coreId": str|None,
  "oai": str|None
  "identifiers": [str],
  "title": str|None,
  "authors": [str],
  "enrichments": {
    "references": [str],
    "documentType": {
       "type": str|None,
       "confidence": str|None
    }
  },
  "contributors": [str],
  "datePublished": str|None,
  "abstract": str|None,
  "downloadUrl": str|None,
  "fullTextIdentifier": str|None,
  "pdfHashValue": str|None,
  "publisher": str|None,
  "rawRecordXml": str|None
  "journals":[str],
  "language": str|None,
  "relations": [str],
  "year": int|None,
  "topics": [str],
  "subjects": [str],
  "fullText": str|None
}

Structure of dataset 2017

An example of a metadata item in the data set is as follows. The full record will have more fields available and all fields in its entirety. New lines and truncated values are only for this example.

{
  "id": "28929927",
  "authors": [
    "Knoth, Petr",
    "Anastasiou, Lucas",
    "Pearce, Samuel"
  ],
  "datePublished": "2014",
  "deleted": "ALLOWED",
  "description": "Usage statistics are frequently used by repositories [Description field truncated for example]",
    "fullText": "Open Research Online\nThe Open University’s repository of research publications\nand other research outputs\nMy repository is being aggregated: a blessing or a\ncurse?\nConference Item\nHow to [full text field truncated for example]"
  "fullTextIdentifier": "http://oro.open.ac.uk/41678/1/OpenRepositories2014_v2.pdf",
  "identifiers": [
    "oai:oro.open.ac.uk:41678",
    null
  ],
  "rawRecordXml": "

`\n    \n    \n      oai:oro.open.ac.uk:41678\n      20[rawRecordXml truncated for example]",
  "repositories": [{
    "id": "86",
    "openDoarId": 0,
    "name": "Open Research Online",
      ...
  }],
  "repositoryDocument": {
    "pdfStatus": 1,
    "textStatus": 1,
    "metadataUpdated": 1498862655000,
    "timestamp": 1479481001000,
    "indexed": 1,
    "deletedStatus": "0",
    "pdfSize": 364107,
    "tdmOnly": false
  },
  "title": "My repository is being aggregated: a blessing or a curse?",
  "downloadUrl": "https://core.ac.uk/download/pdf/28929927.pdf",
  ...
}

Structure of dataset pre - 2017

The CORE dataset provides access to both the enriched metadata as well as the full-texts. The data dump consists of two files, the metadata file and the content file. Both files are compressed using tar and gzip.

An example of a metadata item in the data set is as follows:

{
  "identifier": 13291,
  "ep:Repository": 1,
  "dc:type": [
    "Report"
  ],
  "bibo:shortTitle": "Evaluating stillbirths : improving stillbirth data could help make stillbirths a visible public health priority",
  "bibo:AuthorList": [
    "IMMPACT",
    "Population Reference Bureau"
  ],
  "dc:date": "2007-02",
  "bibo:cites": [
    {
      "rawReferenceText": "Cynthia Stanton. Stillbirth Rates: Delivering Estimates",
      "authors": [

      ],
      "bibo:shortTitle": "Stillbirth Rates: Delivering Estimates",
      "doi": "10.1016/S0140-6736(06)68586-3"
    }
  ],
  "bibo:citedBy": [

  ],
  "similarities": [
    {
      "identifier": 29886,
      "sim:weight": 0.333121,
      "sim:AssociationMethod": "similarity_cosine"
    },
    {
      "identifier": 33044,
      "sim:weight": 0.325861,
      "sim:AssociationMethod": "similarity_cosine"
    },
    ...,
    {
      "identifier": 43755,
      "sim:weight": 0.173635,
      "sim:AssociationMethod": "similarity_cosine"
    }
  ]
}

CORE Dataset documentation

How to download the Dataset

Reminder:

Latest datasets:

Dataset 2023

Other datasets:

CORE-MAG mapping 2019-04-01

Deduplication Dataset 2020

Archived datasets:

Dataset 2022

Dataset 2020-03-18

Dataset 2018-03-01

Dataset 2018-03-01

Dataset 2017-11-01

Dataset 2017-11-01

Structure of datasets

Structure of dataset 2020

Fields description

Structure of dataset 2018 onwards

Structure of dataset 2017

Structure of dataset pre - 2017