Skip to main content


If you use CORE in your work, we kindly request that you cite our publication(s).


CORE API image

CORE harvests, maintains, enriches and makes available metadata and full-text content (typically a PDF) from many Open Access journals and repositories. This makes it a useful access point for those who would like to develop applications making use of this content. To support these activities, CORE is providing a free API.

If you use CORE in your work, we kindly request you to cite one of our publications.


The documentation, along with live examples can be found here.

You can also view some practical examples using the CORE API in this TDM course.

Expected use

We expect the API can be used, for example, to:

  • Perform text mining to enrich metadata of Open Access publications or to even perform different kinds of semantic analysis of publications.
  • Semantically annotate (by means of crowdsourcing, collaborative sharing or natural language processing) the publications to drive the emergence of nano-publications in certain research fields.
  • Link publications to research data.
  • Carry out impact and citation analysis in the Open Access domain.
  • Many other services that need quick and easy access to the content of research publications, etc.

Where to start

Please register here to receive an API key and start testing the live examples.

A good starting point to start coding with our API is to follow the iPython notebook available on Github.

In collaboration with rOpenSci you can find here an R client for the CORE API.


We apply a quota to the API to allow fair access and a high response time to our services. Please get in touch if you require accessing our API at a faster rate.
The quota for each method are listed in the following tables:

Global methods

MethodRequest typeLimit
/searchbatch1 requests per 10 seconds
/search/{query}single5 requests per 10 seconds

Article methods

MethodRequest typeLimit
/articles/getbatch1 requests per 10 seconds
/articles/get/{coreId}single10 requests per 10 seconds
/articles/get/{coreId}/download/pdfsingle10 requests per 10 seconds
/articles/get/{coreId}/historysingle10 requests per 10 seconds
/articles/searchbatch1 requests per 10 seconds
/articles/search/{query}single10 requests per 10 seconds
/articles/similarsingle10 requests per 10 seconds

Journal methods

MethodRequest typeLimit
/journals/getbatch1 requests per 10 seconds
/journals/get/{issn}single10 requests per 10 seconds
/journals/searchbatch2 requests per 10 seconds
/journals/search/{query}single5 requests per 10 seconds

Repository methods

MethodRequest typeLimit
/repositories/getbatch1 requests per 10 seconds
/repositories/get/{repositoryId}single10 requests per 10 seconds
/repositories/searchbatch2 requests per 10 seconds
/repositories/search/{query}single5 requests per 10 seconds

In case you require different limits please contact us.

CORE data as Linked Open Data (LOD)

Apart from the CORE API, CORE also provides data as LOD for enthusiasts. The documentation is available at the datahub. Please note the data is not synced regularly. We encourage all developers to use the CORE API v2.

CORE Dataset

The data aggregated from repositories by the CORE system can be accessed in two ways, through the CORE API or by downloading the data to your computer. The former option is practical if you want to build a service on top of CORE while the latter is something we recommend to those who would like to analyse the CORE dataset and/or apply some computationally intensive batch processes.

If you use CORE in your work, we kindly request you to cite one of our publications.

Available datasets:

2018 onwards:

Microsoft Academic Graph - CORE mapping This dataset includes the mapping between the CORE data and the Microsoft Academic Graph dataset based on DOI. The compressed version of the dataset is 71MB and contains 7,879,310 million rows in a csv with the following format:

Register for Access

Dataset 2018-03-01
Register for Access
Metadata only dataset (beta) (127 GB) - 123M metadata items, 85.6M items with abstract
With full text dataset (beta) (330 GB) - 123M metadata items, 85.6M items with abstract, 9.8M items with fulltext

For extracting all repositories, you will need up to 509 GB (metadata) or 1.3 TB (full text) of free space on disk. It is possible to extract each repository individually. Please read the section "Structure of the dumps" below for more information.


Dataset 2017-11-01
Register for Access
Metadata only dataset (beta) (22.65 GB) - 64 million items
With full text dataset (beta) (157.38 GB) - 8 million items

Beta notice: while in beta, we cannot guarantee completeness or integrity of the dataset and we appreciate any feedback on the format or data of the dataset. We are aware of an issue where the statistics in the repository object are incorrect or incomplete. We do not use these fields internally and plan on removing them in future releases.

Older versions (pre-2017):

These datasets conform to the dataset structure described below as 'Pre-2017'

  • Dataset 2016-10
    Register for Access
    Metadata dataset (9.0 GB) - (23.9 million items)
    Content dataset (102 GB) - (4 million items)
  • Dataset 2015-09
    Register for Access
    Metadata dataset (4.5 GB)
    Content dataset (30.5 GB)
  • Dataset 2014-06-13 (used for dataset track of DL2014)
    Register for Access
    Metadata dataset (3.7 GB)
    Content dataset (24 GB)

Structure of the dumps

2018 onwards

The downloadable tar file contains XZ compressed files of Article Metadata. The XZ compressed file is a file named [repositoryID].json.xz. Once decompressed, each line in the text file contains the metadata for 1 article in JSON.

We chose the xz format due to a better compression ratio vs bzip2 or gzip. The downside is the tools are not always installed by default.
Most Linux distributions have xz available for installation in the default package manager. Mac users can install xz via Brew or MacPorts and there are many other free alternatives. Windows users can use 7-zip. If you have any trouble extracting the files, please contact us.

Please note that each JSON file is not valid JSON however, each line is. Each line is delimited using a Windows formatted newline (\r\n).

The dump structure has changed to following format:
        "doi": str|None,
        "coreId": str|None,
        "oai": str|None
        "identifiers": [str],
        "title": str|None,
        "authors": [str],
        "enrichments": {
          "references": [str],
          "documentType": {
             "type": str|None,
             "confidence": str|None
        "contributors": [str],
        "datePublished": str|None,
        "abstract": str|None,
        "downloadUrl": str|None,
        "fullTextIdentifier": str|None,
        "pdfHashValue": str|None,
        "publisher": str|None,
        "rawRecordXml": str|None
        "language": str|None,
        "relations": [str],
        "year": int|None,
        "topics": [str],
        "subjects": [str],
        "fullText": str|None


An example of a metadata item in the data set is as follows. The full record will have more fields available and all fields in its entirety. New lines and truncated values are only for this example.

	"id": "28929927",
	"authors": [
		"Knoth, Petr",
		"Anastasiou, Lucas",
		"Pearce, Samuel"
	"datePublished": "2014",
	"deleted": "ALLOWED",
	"description": "Usage statistics are frequently used by repositories [Description field truncated for example]",
    "fullText": "Open Research Online\nThe Open University’s repository of research publications\nand other research outputs\nMy repository is being aggregated: a blessing or a\ncurse?\nConference Item\nHow to [full text field truncated for example]"
	"fullTextIdentifier": "",
	"identifiers": [
	"rawRecordXml": "
\n \n \n\n 20[rawRecordXml truncated for example]", "repositories": [{ "id": "86", "openDoarId": 0, "name": "Open Research Online", ... }], "repositoryDocument": { "pdfStatus": 1, "textStatus": 1, "metadataUpdated": 1498862655000, "timestamp": 1479481001000, "indexed": 1, "deletedStatus": "0", "pdfSize": 364107, "tdmOnly": false }, "title": "My repository is being aggregated: a blessing or a curse?", "downloadUrl": "", ... }


The CORE dataset provides access to both the enriched metadata as well as the full-texts. The data dump consists of two files, the metadata file and the content file. Both files are compressed using tar and gzip.

The structure of the metadata file is depicted in the diagram and an example of a metadata item in the data set is as follows:

    "identifier": 13291,
    "ep:Repository": 1,
    "dc:type": [
    "bibo:shortTitle": "Evaluating stillbirths : improving stillbirth data could help make stillbirths a visible public health priority",
    "bibo:AuthorList": [
        "Population Reference Bureau"
    "dc:date": "2007-02",
    "bibo:cites": [
            "rawReferenceText": "Cynthia Stanton. Stillbirth Rates: Delivering Estimates",
            "authors": [

            "bibo:shortTitle": "Stillbirth Rates: Delivering Estimates",
            "doi": "10.1016/S0140-6736(06)68586-3"
    "bibo:citedBy": [

    "similarities": [
            "identifier": 29886,
            "sim:weight": 0.333121,
            "sim:AssociationMethod": "similarity_cosine"
            "identifier": 33044,
            "sim:weight": 0.325861,
            "sim:AssociationMethod": "similarity_cosine"
            "identifier": 43755,
            "sim:weight": 0.173635,
            "sim:AssociationMethod": "similarity_cosine"


This dataset has been created from information that was publicly available on the Internet. Every effort has been made to ensure this dataset contains open access content only. We have included content only from repositories and journals that are listed in registries where the condition for inclusion is the provision of content under an open access compatible license. However, as metadata are often inconsistent, license information is often not machine readable and, from time to time, repositories leak information that is not open access, we cannot take any responsibility for the license of the content in the dataset. It is therefore up to the user of this dataset to ensure that the way in which they use the dataset does not breach copyright. The dataset is in no way intended for the purposes of reading the original publications, but for machine processing only.


In an effort to improve the quality and transparency of the harvesting process of the open access content and create a two way collaboration between the CORE project and the providers of this content, CORE is introducing the Repositories Dashboard.

Access the dashboard now.

To register, please send us an email.

The aim of the Dashboard is to provide an online interface for repository providers and offer, through this online interface, valuable information to content providers about:

  • The content harvested from the repository enabling its management, such as by requesting metadata updates or managing take-down requests.
  • The times and frequency of content harvesting, including all detected technical issues and suggestions for improving the efficiency of harvesting and the quality of metadata, including compliance with existing metadata guidelines.
  • Statistics regarding the repository content, such as the distribution of content according to subject fields and types of research outputs, and the comparison of these with the national average.

Existing users can invite other users to use the CORE dashboard, see our simple guide.

CORE Recommender

The new version of the CORE recommender has now been released.

The recommender is a plugin that can be installed in repositories and journal systems to suggest similar articles. Its purpose is to support users in finding articles relevant to what they read.

The current version of the plugin recommends full-text items in Open Access repositories that are related to:

  • a metadata record
  • a full-text item in pdf
  • any piece of text
  • any combination of the above

The CORE Recommender is deployed in various locations, such as on the CORE Portal and in institutional repositories and journals.

Uniqueness of the CORE Recommender:

  • Our methods rely on the availability of full-texts.
  • We don’t base our recommendations solely on abstracts or metadata.
  • We ensure that the recommended articles are available open access.
  • We provide our recommendation service for free.
  • We provide it using a machine accessible interface (API).

Find out more about the CORE Recommender here. To install the recommender visit our registration page.

For those with access to the CORE Repositories Dashboard: the Recommender installation guidelines and an installation key can be found in the Dashboard. Log into the Dashboard and then choose the tab "Get the recommender".

Publisher connector

The CORE Publisher Connector is a software providing seamless access to Gold and Hybrid Gold Open Access articles aggregated from non-standard systems of major publishers. CORE now harvests from several publishers using the Publisher connector engine, which offers a unique way of accessing scientific content from scholarly publishers. Data is exposed via the ResourceSync protocol.

ResourceSync is a protocol that overcomes the limitations of the OAI-PMH protocol; goes further than just metadata exchange, enables sharing of any kind of resource and offers advanced synchronization mechanisms over the web.

CORE is one of the first to deploy ResourceSync for distributing large amounts of scholarly literature that scales to millions of items and is capable of real-time updates. In our deployment of ResourceSync, we utilise the generic notion of a resource in the protocol and share more than one representation, i.e. each record contains both metadata and full text.

You can access the content offered by the publisher connector in CORE's ResourceSync endpoint :