We use cookies

What are these?

Required to make the site work, and to save the options you make here

What are these?

Lets us (anonymously) track site usage, so that we can measure performance and make improvements

Download millions of research outputs for text and data analysis

CORE Dataset's screenshot
    Key message 1

    Download all CORE data for big data processing

    Key message 2

    Prototype, analyse and mine your data in your infrastructure

    Key message 3

    World's largest full text collection of scientific papers for machine processing

CORE data can be downloaded as a bulk dataset, allowing you to process it on your own computer or within your infrastructure. The dataset provides a harmonised and enriched data format for access content from across our data providers. This is perfect for prototyping new methods, especially when intensive data processes need to be run. It is also a good choice for data analysis and text mining.

If you use CORE in your work, we kindly request you to cite one of our publications.

Dataset 2018-03-01

Metadata only dataset (beta) (127 GB) - 123M metadata items, 85.6M items with abstract

With full text dataset (beta) (330 GB) - 123M metadata items, 85.6M items with abstract, 9.8M items with fulltext.

Documentation and access to previous datasets.

What’s included

The dataset is free to use for non-commercial purposes. If you want to use the dataset for commercial purposes, including consultancies and commercial analysis, please contact us. The dataset provides you with:

  • The entire CORE's corpus of both metadata and full texts in a machine processable format.
  • Mappings of the CORE articles to entities in the Micrsoft Academic Graph (MAG), enabling to access CORE fulltexts and use additional entities from MAG where available.
  • Detailed documentation on how to download the CORE dataset and how data is organised.

Disclaimer

This dataset has been created from information that was publicly available on the Internet. Every effort has been made to ensure this dataset contains open access content only. We have included content only from repositories and journals that are listed in registries where the condition for inclusion is the provision of content under an open access compatible license. However, as metadata are often inconsistent, license information is often not machine readable and, from time to time, repositories leak information that is not open access, we cannot take any responsibility for the license of the content in the dataset. It is therefore up to the user of this dataset to ensure that the way in which they use the dataset does not breach copyright. The dataset is in no way intended for the purposes of reading the original publications, but for machine processing only.