International Journal of Digital Curation
Not a member yet
    533 research outputs found

    Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

    Get PDF
    Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine. Keywords: Data Cleaning, Provenance, Workflow Analysi

    Research Data Management Practices at the University of Namibia: Moving Towards Adoption

    Get PDF
    The management of research data in academic institutions is increasing across most disciplines. In Namibia, the requirement to manage research data, making it available for the purposes of sharing, preservation and to support research findings, has not yet been mandated. At the University of Namibia (UNAM) there is no institutional research data management (RDM) culture, yet RDM may nevertheless be practiced among its researchers. The extent to which these practices have been adopted is, however, not known. This study investigated the extent of RDM adoption by researchers at UNAM. It identifies current or potential challenges in managing research data, and proposes solutions to some of these challenges that could aid the university as it attempts to encourage the adoption of RDM practices. The investigation used Rogers’ Diffusion of Innovations (DOI) theory, with a focus on the innovation-decision process, as a means to establish where UNAM researchers are in the process of adopting RDM practices. The population under study were the UNAM faculty members who conduct research as part of their academic duties. Questionnaires were used to gather quantitative data. The study found that some researchers practice RDM to some extent out of their own free will, but there are many challenges that hinder these practices. Overall, though, there is a lack of interest in RDM as the knowledge of the concept among researchers is relatively low. The study found that most researchers were at the knowledge stage of the innovation-decision process and recommended, among other things, that the university puts effort into creating RDM awareness and encouraging data sharing, and that it moves forward with infrastructure and policy development so that RDM can be fully adopted by the researchers of the institution.&nbsp

    OpenStack Swift: An Ideal Bit-Level Object Storage System for Digital Preservation

    Get PDF
    A bit-level object storage system is a foundational building block of long-term digital preservation (LTDP). To achieve the purposes of LTDP, the system must be able to: preserve the authenticity and integrity of the original digital objects; scale up with dramatically increasing demands for preservation storage; mitigate the impact of hardware obsolescence and software ephemerality; replicate digital objects among distributed data centers at different geographical locations; and to constantly audit and automatically recover from compromised states. A realistic and daunting challenge to satisfy these requirements is not only to overcome technological difficulties but also to maintain economic sustainability by implementing and continuously operating such systems in a cost-effective way. In this paper, we present OpenStack Swift, an open-source, mature and widely accepted cloud platform, as a practical and proven solution with a case study at the University of Alberta Library. We emphasize the implementation, application, cost analysis and maintenance of the system, with the purpose of contributing to the community with an exceedingly robust, highly scalable, self-healing and comparatively cost-effective bit-level object storage system for long-term digital preservation.&nbsp

    Automation is Documentation: Functional Documentation of Human-Machine Interaction for Future Software Reuse

    Get PDF
    Preserving software and providing access to obsolete software is necessary and will become even more important for work with any kind of born-digital artifacts. While usability and availability of emulation in digital curation and preservation workflow has improved significantly, productive (re)use of preserved obsolete software is a growing concern, due to a lack of (future) operational knowledge. In this article we describe solutions to automate and document software usage in a way, such that the result is not only instructive but also productive

    Building LABDRIVE, a Petabyte Scale, OAIS/ISO 16363 Conformant, Environmentally Sustainable Archive, Tested by Large Scientific Organisations to Preserve their Raw and Processed Data, Software and Documents

    Get PDF
    Vast amounts of scientific, cultural, social, business and government, and other, information is being created every day. There are billions of objects, in a multitude of formats, semantics and associated software. Much, perhaps the majority, of this information is transitory but there is still an immense amount which should be preserved for the medium and long term – perhaps even indefinitely. Preservation requires that the information continues to be usable, not simply to be printed or displayed. Of course, the digital objects (the bits) must be preserved, as must the “metadata” which enables the bits to the understood which includes the software. Before LABDRIVE no system could adequately preserve such information, especially in such gigantic volume and variety.  In this paper we describe the development of LABDRIVE and its ability to preserve tens or hundreds of petabytes in a way which is conformant to the OAIS Reference Model and capable of being ISO 16363 certified

    Synchronic Curation for Assessing Reuse and Integration Fitness of Multiple Data Collections

    Get PDF
    Data driven applications often require using data integrated from different, large, and continuously updated collections. Each of these collections may present gaps, overlapping data, have conflicting information, or complement each other. Thus, a curation need is to continuously assess if data from multiple collections are fit for integration and reuse. To assess different large data collections at the same time, we present the Synchronic Curation (SC) framework. SC involves processing steps to map the different collections to a unifying data model that represents research problems in a scientific area. The data model, which includes the collections' provenance and a data dictionary, is implemented in a graph database where collections are continuously ingested and can be queried. SC has a collection analysis and comparison module to track updates, and to identify gaps, changes, and irregularities within and across collections. Assessment results can be accessed interactively through a web-based interactive graph. In this paper we introduce SC as an interdisciplinary enterprise, and illustrate its capabilities through its implementation in ASTRIAGraph, a space sustainability knowledge system

    An Approach for Curating Collections of Historical Documents with the Use of Topic Detection Technologies

    Get PDF
    Digital curation of materials available in large online repositories is required to enable the reuse of Cultural Heritage resources in specific activities like education or scientific research. The digitization of such valuable objects is an important task for making them accessible through digital platforms such as Europeana, therefore ensuring the success of transcription campaigns via the Transcribathon platform is highly important for this goal. Based on impact assessment results, people are more engaged in the transcription process if the content is more oriented to specific themes, such as First World War. Currently, efforts to group related documents into thematic collections are in general hand-crafted and due to the large ingestion of new material they are difficult to maintain and update. The current solutions based on text retrieval are not able to support the discovery of related content since the existing collections are multi-lingual and contain heterogeneous items like postcards, letters, journals, photographs etc. Technological advances in natural language understanding and in data management have led to the automation of document categorization and via automatic topic detection. To use existing topic detection technologies on Europeana collections there are several challenges to be addressed: (1) ensure representative and qualitative training data, (2) ensure the quality of the learned topics, and (3) efficient and scalable solutions for searching related content based on the automatically detected topics, and for suggesting the most relevant topics on new items. This paper describes in more details each such challenge and the proposed solutions thus offering a novel perspective on how digital curation practices can be enhanced with the help of machine learning technologies

    On the Reusability of Data Cleaning Workflows

    Get PDF
    The goal of data cleaning is to make data fit for purpose, i.e., to improve data quality, through updates and data transformations, such that downstream analyses can be conducted and lead to trustworthy results. A transparent and reusable data cleaning workflow can save time and effort through automation, and make subsequent data cleaning on new data less errorprone. However, reusability of data cleaning workflows has received little to no attention in the research community. We identify some challenges and opportunities for reusing data cleaning workflows. We present a high-level conceptual model to clarify what we mean by reusability and propose ways to improve reusability along different dimensions. We use the opportunity of presenting at IDCC to invite the community to share their uses cases, experiences, and desiderata for the reuse of data cleaning workflows and recipes in order to foster new collaborations and guide future work

    Uncommon Commons? Creative Commons Licencing in Horizon 2020 Data Management Plans

    Get PDF
    As policies, good practices and mandates on research data management evolve, more emphasis has been put on the licencing of data, which allows potential re-users to quickly identify what they can do with the data in question. In this paper I analyse a pre-existing collection of 840 Horizon 2020 public data management plans (DMPs) to determine which ones mention creative commons licences and among those who do, which licences are being used. I find that 36% of DMPs mention creative commons and among those a number of different approaches towards licencing exist (overall policy per project, licencing decisions per dataset, licencing decisions per partner, licensing decision per data format, licensing decision per perceived stakeholder interest), often clad in rather vague language with CC licences being “recommended” or “suggested”. Some DMPs also “kick the can further down the road” by mentioning that “a” CC licence will be used, but not which one. However, among those DMPs that do mention specific CC licences, a clear favourite emerges: the CC-BY licence, which accounts for half of the total mentioning of a specific licence. The fact that 64% of DMPs did not mention creative commons at all is an indication for the need for further training and awareness raising on data management in general and licencing in particular in Horizon Europe. For those DMPs that do mention specific licences, 60% would be compliant with Horizon Europe requirements (CC-BY or CC0). However, it should be carefully monitored whether content similar to the 40% that is currently licenced with non- Horizon Europe compliant licences will in the future move to CC-BY or CC0 or whether such content will simply be kept fully closed by projects (by invoking the “as open as possible, as close as necessary” principle), which would be an unintended and potentially damaging consequence of the policy

    The Role of Data in an Emerging Research Community:

    Get PDF
    Open science data benefit society by facilitating convergence across domains that are examining the same scientific problem. While cross-disciplinary data sharing and reuse is essential to the research done by convergent communities, so far little is known about the role data play in how these communities interact. An understanding of the role of data in these collaborations can help us identify and meet the needs of emerging research communities which may predict the next challenges faced by science. This paper represents an exploratory study of one emerging community, the environmental health community, examining how environmental health research groups form, collaborate, and share data. Five key insights about the role of data in emerging research communities are identified and suggestions are made for further research

    522

    full texts

    533

    metadata records
    Updated in last 30 days.
    International Journal of Digital Curation
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇