6 research outputs found

    Automating Software Citation using GitCite

    Full text link
    The ability to cite software and give credit to its authors is increasingly important. This paper presents a model for software citation with version control, and an implementation that integrates with Git and GitHub. The implementation includes a browser extension and a local executable tool, which enable citations to be added/modified/deleted to software project repositories and managed through functions such as fork/merge/copy

    Automating Software Citation using GitCite

    Full text link
    The ability to cite software and give credit to its authors and contributors is increasingly important. While the number of online open-source software repositories has grown rapidly over the past few years, few are being properly cited when used due to the difficulty of creating appropriate citations and the lack of automated techniques. This paper presents GitCite, a model for software citation with version control which enables citations to be inferred for any project component based on a small number of explicit citations attached to subdirectories/files, and an implementation that integrates with Git and GitHub. The implementation includes a browser extension and a local executable tool, which enable citations to be added/modified/deleted to software project repositories and managed through functions such as fork/merge/copy

    Data Citation: A New Provenance Challenge

    Get PDF

    Data citation and the citation graph

    Get PDF
    The citation graph is a computational artifact that is widely used to represent the domain of published literature. It represents connections between published works, such as citations and authorship. Among other things, the graph supports the computation of bibliometric measures such as h-indexes and impact factors. There is now an increasing demand that we should treat the publication of data in the same way that we treat conventional publications. In particular, we should cite data for the same reasons that we cite other publications. In this paper we discuss what is needed for the citation graph to represent data citation. We identify two challenges: to model the evolution of credit appropriately (through references) over time and to model data citation not only to a data set treated as a single object but also to parts of it. We describe an extension of the current citation graph model that addresses these challenges. It is built on two central concepts: citable units and reference subsumption. We discuss how this extension would enable data citation to be represented within the citation graph and how it allows for improvements in current practices for bibliometric computations, both for scientific publications and for data

    Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications

    Get PDF
    Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios

    Automating data citation in CiteDB

    No full text
    corecore