57,326 research outputs found

    Digital provenance - models, systems, and applications

    Get PDF
    Data provenance refers to the history of creation and manipulation of a data object and is being widely used in various application domains including scientific experiments, grid computing, file and storage system, streaming data etc. However, existing provenance systems operate at a single layer of abstraction (workflow/process/OS) at which they record and store provenance whereas the provenance captured from different layers provide the highest benefit when integrated through a unified provenance framework. To build such a framework, a comprehensive provenance model able to represent the provenance of data objects with various semantics and granularity is the first step. In this thesis, we propose a such a comprehensive provenance model and present an abstract schema of the model. ^ We further explore the secure provenance solutions for distributed systems, namely streaming data, wireless sensor networks (WSNs) and virtualized environments. We design a customizable file provenance system with an application to the provenance infrastructure for virtualized environments. The system supports automatic collection and management of file provenance metadata, characterized by our provenance model. Based on the proposed provenance framework, we devise a mechanism for detecting data exfiltration attack in a file system. We then move to the direction of secure provenance communication in streaming environment and propose two secure provenance schemes focusing on WSNs. The basic provenance scheme is extended in order to detect packet dropping adversaries on the data flow path over a period of time. We also consider the issue of attack recovery and present an extensive incident response and prevention system specifically designed for WSNs

    Digital provenance - models, systems, and applications

    Get PDF
    Data provenance refers to the history of creation and manipulation of a data object and is being widely used in various application domains including scientific experiments, grid computing, file and storage system, streaming data etc. However, existing provenance systems operate at a single layer of abstraction (workflow/process/OS) at which they record and store provenance whereas the provenance captured from different layers provide the highest benefit when integrated through a unified provenance framework. To build such a framework, a comprehensive provenance model able to represent the provenance of data objects with various semantics and granularity is the first step. In this thesis, we propose a such a comprehensive provenance model and present an abstract schema of the model. ^ We further explore the secure provenance solutions for distributed systems, namely streaming data, wireless sensor networks (WSNs) and virtualized environments. We design a customizable file provenance system with an application to the provenance infrastructure for virtualized environments. The system supports automatic collection and management of file provenance metadata, characterized by our provenance model. Based on the proposed provenance framework, we devise a mechanism for detecting data exfiltration attack in a file system. We then move to the direction of secure provenance communication in streaming environment and propose two secure provenance schemes focusing on WSNs. The basic provenance scheme is extended in order to detect packet dropping adversaries on the data flow path over a period of time. We also consider the issue of attack recovery and present an extensive incident response and prevention system specifically designed for WSNs

    A provenance metadata model integrating ISO geospatial lineage and the OGC WPS : conceptual model and implementation

    Get PDF
    Nowadays, there are still some gaps in the description of provenance metadata. These gaps prevent the capture of comprehensive provenance, useful for reuse and reproducibility. In addition, the lack of automated tools for capturing provenance hinders the broad generation and compilation of provenance information. This work presents a provenance engine (PE) that captures and represents provenance information using a combination of the Web Processing Service (WPS) standard and the ISO 19115 geospatial lineage model. The PE, developed within the MiraMon GIS & RS software, automatically records detailed information about sources and processes. The PE also includes a metadata editor that shows a graphical representation of the provenance and allows users to complement provenance information by adding missing processes or deleting redundant process steps or sources, thus building a consistent geospatial workflow. One use case is presented to demonstrate the usefulness and effectiveness of the PE: the generation of a radiometric pseudo-invariant areas bench for the Iberian Peninsula. This remote-sensing use case shows how provenance can be automatically captured, also in a non-sequential complex flow, and its essential role in the automation and replication tasks in work with very large amounts of geospatial data

    Modeling the Dashboard Provenance

    Full text link
    Organizations of all kinds, whether public or private, profit-driven or non-profit, and across various industries and sectors, rely on dashboards for effective data visualization. However, the reliability and efficacy of these dashboards rely on the quality of the visual and data they present. Studies show that less than a quarter of dashboards provide information about their sources, which is just one of the expected metadata when provenance is seriously considered. Provenance is a record that describes people, organizations, entities, and activities that had a role in the production, influence, or delivery of a piece of data or an object. This paper aims to provide a provenance representation model, that entitles standardization, modeling, generation, capture, and visualization, specifically designed for dashboards and its visual and data components. The proposed model will offer a comprehensive set of essential provenance metadata that enables users to evaluate the quality, consistency, and reliability of the information presented on dashboards. This will allow a clear and precise understanding of the context in which a specific dashboard was developed, ultimately leading to better decision-making.Comment: 8 pages, 4 figures, one table, to be published in VIS 2023 (Vis + Prov) x Domai

    Data Provenance Inference in Machine Learning

    Full text link
    Unintended memorization of various information granularity has garnered academic attention in recent years, e.g. membership inference and property inference. How to inversely use this privacy leakage to facilitate real-world applications is a growing direction; the current efforts include dataset ownership inference and user auditing. Standing on the data lifecycle and ML model production, we propose an inference process named Data Provenance Inference, which is to infer the generation, collection or processing property of the ML training data, to assist ML developers in locating the training data gaps without maintaining strenuous metadata. We formularly define the data provenance and the data provenance inference task in ML training. Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow learning. Comprehensive evaluations cover language, visual and structured data in black-box and white-box settings, with diverse kinds of data provenance (i.e. business, county, movie, user). Our best inference accuracy achieves 98.96% in the white-box text model when "author" is the data provenance. The experimental results indicate that, in general, the inference performance positively correlated with the amount of reference data for inference, the depth and also the amount of the parameter of the accessed layer. Furthermore, we give a post-hoc statistical analysis of the data provenance definition to explain when our proposed method works well

    A Linked Data Approach to Sharing Workflows and Workflow Results

    No full text
    A bioinformatics analysis pipeline is often highly elaborate, due to the inherent complexity of biological systems and the variety and size of datasets. A digital equivalent of the ‘Materials and Methods’ section in wet laboratory publications would be highly beneficial to bioinformatics, for evaluating evidence and examining data across related experiments, while introducing the potential to find associated resources and integrate them as data and services. We present initial steps towards preserving bioinformatics ‘materials and methods’ by exploiting the workflow paradigm for capturing the design of a data analysis pipeline, and RDF to link the workflow, its component services, run-time provenance, and a personalized biological interpretation of the results. An example shows the reproduction of the unique graph of an analysis procedure, its results, provenance, and personal interpretation of a text mining experiment. It links data from Taverna, myExperiment.org, BioCatalogue.org, and ConceptWiki.org. The approach is relatively ‘light-weight’ and unobtrusive to bioinformatics users

    Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv

    Get PDF
    Background: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). However, there are still several challenges associated with the effective sharing, publication and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. Results: Based on best practice recommendations identified from literature on workflow design, sharing and publishing, we define a hierarchical provenance framework to achieve uniformity in the provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realise this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We utilise open source community-driven standards; interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric Research Objects (RO) generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. Conclusions: The underlying principles of the standards utilised by CWLProv enable semantically-rich and executable Research Objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, re-use the methods for partial re-runs, or reproduce the analysis to validate the published findings.Submitted to GigaScience (GIGA-D-18-00483

    ENA as an Information Hub

    Get PDF
    The European Nucleotide Archive (ENA; "http://www.ebi.ac.uk/ena/":http://www.ebi.ac.uk/ena/) is a comprehensive repository for public nucleotide sequence data from nearly four hundred thousand taxonomic nodes. Together with partners in the International Nucleotide Sequence Database Collaboration (INSDC; EBI, NCBI and DDBJ) we provide a broad spectrum of sequences, from raw reads (Sequence Read Archive data class), assembled contigs (Whole Genome Shotgun data class), assemblies of EST transcripts (Transcriptome Shotgun Assembly data set), to partial or complete assembled nucleic acid molecules with functional annotation derived from direct and third party experimental evidence (Standard and TPA data classes, respectively). Resources beyond ENA, such as RNA and protein databases, genome collections and model organism services, use data stored and presented at ENA as both source and underlying supporting evidence for their records. Integration of the growing wealth of molecular information is a great challenge that brings opportunities for ENA to serve as a bioinformatics data information hub, allowing, through its provision of permanent identifiers for sequence and project records, community-recognized identifiers for navigation across databases.

As a comprehensive repository of directly sequenced nucleic acid molecules we have the unique opportunity to obtain exact provenance information directly from the submitting researchers. Our pre-publication biocuration efforts are focused on obtaining rich and accurate information on the sample that has been sequenced and on the methodology surrounding its preparation for sequencing. We present here an insight into data flow in the archive and a straightforward biologist-orientated submission system with a rule-based validator for smaller sets of sequences

    Sciunits: Reusable Research Objects

    Full text link
    Science is conducted collaboratively, often requiring knowledge sharing about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. In this paper, we present the sciunit, a reusable research object in which aggregated content is recomputable. We describe a Git-like client that efficiently creates, stores, and repeats sciunits. We show through analysis that sciunits repeat computational experiments with minimal storage and processing overhead. Finally, we provide an overview of sharing and reproducible cyberinfrastructure based on sciunits gaining adoption in the domain of geosciences
    • …
    corecore