Data quality, transparency and reproducibility in large bibliographic datasets

Abstract

Increasingly, large bibliographic databases are hosted by dedicated teams that commit to database quality, curation, and sharing, thereby providing excellent sources of data. Some databases, such as PubMed or HathiTrust Digital Library, offer APIs and describe the steps to retrieve or process their data. Others of comparable size and importance to bibliographic scholarship, such as the ACM digital library, still forbid data mining. The additional cleaning and expansion steps required to overcome barriers to data acquisition must be reproducible and incorporated into the curation pipeline, or the use of large bibliographic databases for analysis will remain costly, time-consuming, and inconsistent. In this presentation, we will describe our efforts to create reproducible workflows to generate datasets from three large bibliographic databases: PubMed, DBLP (as a proxy for the ACM digital library), and HathiTrust. We will compare these sources of bibliographic data and address the following: initial download and setup, gap analysis, supplemental sources for data retrieval and integration. By sharing our workflows and discussing both automated and manual steps of data enhancements, we hope to encourage researchers and data providers to think about sharing the responsibility of openness, transparency and reproducibility in re-using large bibliographic database

    Similar works

    Full text

    thumbnail-image

    Available Versions