The CORE dump implements the approach of the ResourceSync Framework
Resource Dump standard.
Note that this is an extremely large file (∼395GB) and appropriate tools are necessary for downloading it. Once extracted it will use about 2.1TB of filesystem.
Perform the extraction by running:
tar -xf resync_dump.tar.xz -C /target/directory
The previous steps will extract the big archive in multiple smaller files.
Each archive contains all the resources for a specific CORE data provider
the full list which you can find at our data providers
page.
The following command extracts every single archive in the
appropriate folder.
#!/bin/bash
for FILE in `ls -1 tmp/*.tar.xz`;
do
PROVIDER="${FILE%.*.*}"
echo $PROVIDER
echo $FILE
mkdir -p output/$PROVIDER
tar xf $FILE -C output/$PROVIDER/
done
Replace PROVIDER with the ID of every single archive.
The extracted folder generated in step 4, is a two-level deep
file structure and includes a Manifest named manifest.xml file
in the root, which lists the resources. Below is the format of
a single entry in the manifest which lists the available resources:
<url>
<loc>https://core.ac.uk/api-v2/articles/get/132196135</loc>
<rs:md
hash="md5:39127e4b3b76fc5a66f3eabee28ab71f"
length="3759"
type="application/json"
path="/182/a2/132196135.json"
/>
</url>
The url inside the <loc></loc>
tags is the ID of the file that can be used
for tracking future updates on the resource. The path attribute is where the
file can be found in the folder structure, and in order to validate the file,
a md5 checksum and the file size are also provided.
This is a sample data structure from the Dataset
{
"doi": DOI,
"coreId": "228783",
"oai": OAI_IDENTIFIER,
"identifiers": [ADDITIONAL IDENTIFIERS],
"title": "TITLE",
"authors": ["AUTHOR1", "AUTHOR2"],
"enrichments": {
"references": [REFERENCES],
"documentType": {
"type": "RESEARCH|THESIS|PRESENTATION",
"confidence": CONFIDENCE
},
"citationCount": COUNT
},
"contributors": [CONTRIBUTORS],
"datePublished": "DATE OR YEAR",
"abstract": "ABSTRACT",
"downloadUrl": DOWNLOAD URL IF AVAILABLE,
"fullTextIdentifier": FULL TEXT ID IF AVAILABLE,
"pdfHashValue": HASH OF THE PDF IF AVAILABLE,
"publisher": PUBLISHER,
"rawRecordXml": "XML RECORD",
"journals": [JOURNALS],
"language": {
"code": "COUNTRY CODE",
"name": "LANGUAGE NAME",
"id": ID
},
"relations": ["URLs WITH RELATIONS"],
"year": PUBLICATION YEAR,
"topics": ["TOPIC1","TOPIC2" ],
"subjects": ["SUBJECT1", "SUBJECT2"],
"issn": "ISSN-IDENTIFIER",
"fullText": "FULL TEXT"
}