Skip to main content

CORE FastSync

Resource Dump

The CORE data dump is made using this standard ResourceSync Framework Resource Dump approach. You can download the dump at the following URL:

Please note that at the moment of writing the dump is 186GB, please use appropriate tools to download such a big file.

To validate the download you can compare the MD5 Checksum by running:

            md5sum resync_dump.tar.xz
And testing that the output hash is the same as this one:
To perform the extraction run:
            tar -xf resync_dump.tar.xz -C /target/directory
Replacing target/directory appropriately. The next command will extract the documents on the smaller archive, each document is a data provider for CORE, the full and updated list of providers can be found here
    tar xf PROVIDER.tar.xz -C PROVIDER/ && cat PROVIDER/manifest_.xml && sed -e s#/data/core-remote/scripts/resync_dump/tmp-new/PROVIDER/##g PROVIDER/manifest_.xml > PROVIDER/manifest.xml
PROVIDER will be the name of every single archive. The command is more complex than it should of a minor error in this version that requires an additional step to create a compliant dataset.

The extracted folder structure is a two level deep file structure and a manifest.xml file in the root that lists the items. This is the format of a single entry in the manifest which lists the available resources:

The link inside the
tag is the ID of the file that can be used for further updates. The path is where the file can be found and, in order to validate the file, an md5sum and file size are also provided.

Data structure

This is a sample data structure from the Resource dump

	"doi": DOI,
	"coreId": "228783",
	"title": "TITLE",
	"authors": ["AUTHOR1", "AUTHOR2"],
	"enrichments": {
		"references": [REFERENCES],
		"documentType": {
			"confidence": CONFIDENCE
	"contributors": [CONTRIBUTORS],
	"datePublished": "DATE OR YEAR",
	"abstract": "ABSTRACT",
	"fullTextIdentifier": FULL TEXT ID IF AVAILABLE,
	"publisher": PUBLISHER,
	"rawRecordXml": "XML RECORD",
	"journals": [JOURNALS],
	"language": {
		"code": "COUNTRY CODE",
		"name": "LANGUAGE NAME",
		"id": ID
	"relations": ["URLs WITH RELATIONS"],
	"topics": ["TOPIC1","TOPIC2" ],
	"subjects": ["SUBJECT1", "SUBJECT2"],
	"fullText": "FULL TEXT"


Keeping the Resource dump up to date

Clone the rs-aggregator project from Github

            git clone
Once cloned, run mvn install from the root of the project. The format of the changelist download is as follow[unix_timestamp_in_ms]/changelist_index.xml
The timestamp should match the last time a download was performed. Once the url is generated please place the url in cfg/uri-list.txt as a first url and run the application with
    java -cp target/rs-aggregator-jar-with-dependencies.jar
This should start the synchronisation process. The updated files will be in the folder below, with the same structure as described in the dump.