A large dataset of software mentions in the biomedical literature

Abstract

<p>We describe the CZ Software Mentions, a new dataset of software mentions in biomedical papers. Plain-text mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset and 934k unique mentions from 3 million papers in the Publishers' collection. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings into 97600 unique software entities, covering 78% of all software-paper links. We link 185,000 of the mentions to repositories, covering about 55% of all software-paper links. We describe in detail the process of building the datasets, disambiguating and linking the software mentions. We make all data and code publicly available to help assess the impact of software (in particular scientific open source projects) on science.</p&gt

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 02/05/2024