26,092 research outputs found

    UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

    Get PDF
    A crucial challenge for scientific workflow management systems is to support the efficient and scalable storage and querying of large provenance datasets that record the history of in silico experiments. As new provenance management systems are being developed, it is important to have benchmarks that can evaluate these systems and provide an unbiased comparison. In this paper, based on the requirements for scientific workflow provenance systems, we design an extensible benchmark that features a collection of techniques and tools for workload generation, query selection, performance measurement, and experimental result interpretation

    Provenance-enabled Packet Path Tracing in the RPL-based Internet of Things

    Full text link
    The interconnection of resource-constrained and globally accessible things with untrusted and unreliable Internet make them vulnerable to attacks including data forging, false data injection, and packet drop that affects applications with critical decision-making processes. For data trustworthiness, reliance on provenance is considered to be an effective mechanism that tracks both data acquisition and data transmission. However, provenance management for sensor networks introduces several challenges, such as low energy, bandwidth consumption, and efficient storage. This paper attempts to identify packet drop (either maliciously or due to network disruptions) and detect faulty or misbehaving nodes in the Routing Protocol for Low-Power and Lossy Networks (RPL) by following a bi-fold provenance-enabled packed path tracing (PPPT) approach. Firstly, a system-level ordered-provenance information encapsulates the data generating nodes and the forwarding nodes in the data packet. Secondly, to closely monitor the dropped packets, a node-level provenance in the form of the packet sequence number is enclosed as a routing entry in the routing table of each participating node. Lossless in nature, both approaches conserve the provenance size satisfying processing and storage requirements of IoT devices. Finally, we evaluate the efficacy of the proposed scheme with respect to provenance size, provenance generation time, and energy consumption.Comment: 14 pages, 18 Figure

    A Forensic Enabled Data Provenance Model for Public Cloud

    Get PDF
    Cloud computing is a newly emerging technology where storage, computation and services are extensively shared among a large number of users through virtualization and distributed computing. This technology makes the process of detecting the physical location or ownership of a particular piece of data even more complicated. As a result, improvements in data provenance techniques became necessary. Provenance refers to the record describing the origin and other historical information about a piece of data. An advanced data provenance system will give forensic investigators a transparent idea about the data\u27s lineage, and help to resolve disputes over controversial pieces of data by providing digital evidence. In this paper, the challenges of cloud architecture are identified, how this affects the existing forensic analysis and provenance techniques is discussed, and a model for efficient provenance collection and forensic analysis is proposed

    Towards Exascale Scientific Metadata Management

    Full text link
    Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

    bdbms -- A Database Management System for Biological Data

    Full text link
    Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    A unified framework for managing provenance information in translational research

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A critical aspect of the NIH <it>Translational Research </it>roadmap, which seeks to accelerate the delivery of "bench-side" discoveries to patient's "bedside," is the management of the <it>provenance </it>metadata that keeps track of the origin and history of data resources as they traverse the path from the bench to the bedside and back. A comprehensive provenance framework is essential for researchers to verify the quality of data, reproduce scientific results published in peer-reviewed literature, validate scientific process, and associate trust value with data and results. Traditional approaches to provenance management have focused on only partial sections of the translational research life cycle and they do not incorporate "domain semantics", which is essential to support domain-specific querying and analysis by scientists.</p> <p>Results</p> <p>We identify a common set of challenges in managing provenance information across the <it>pre-publication </it>and <it>post-publication </it>phases of data in the translational research lifecycle. We define the semantic provenance framework (SPF), underpinned by the Provenir upper-level provenance ontology, to address these challenges in the four stages of provenance metadata:</p> <p>(a) Provenance <b>collection </b>- during data generation</p> <p>(b) Provenance <b>representation </b>- to support interoperability, reasoning, and incorporate domain semantics</p> <p>(c) Provenance <b>storage </b>and <b>propagation </b>- to allow efficient storage and seamless propagation of provenance as the data is transferred across applications</p> <p>(d) Provenance <b>query </b>- to support queries with increasing complexity over large data size and also support knowledge discovery applications</p> <p>We apply the SPF to two exemplar translational research projects, namely the Semantic Problem Solving Environment for <it>Trypanosoma cruzi </it>(<it>T.cruzi </it>SPSE) and the Biomedical Knowledge Repository (BKR) project, to demonstrate its effectiveness.</p> <p>Conclusions</p> <p>The SPF provides a unified framework to effectively manage provenance of translational research data during pre and post-publication phases. This framework is underpinned by an upper-level provenance ontology called Provenir that is extended to create domain-specific provenance ontologies to facilitate provenance interoperability, seamless propagation of provenance, automated querying, and analysis.</p

    ForensiBlock: A Provenance-Driven Blockchain Framework for Data Forensics and Auditability

    Full text link
    Maintaining accurate provenance records is paramount in digital forensics, as they underpin evidence credibility and integrity, addressing essential aspects like accountability and reproducibility. Blockchains have several properties that can address these requirements. Previous systems utilized public blockchains, i.e., treated blockchain as a black box, and benefiting from the immutability property. However, the blockchain was accessible to everyone, giving rise to security concerns and moreover, efficient extraction of provenance faces challenges due to the enormous scale and complexity of digital data. This necessitates a tailored blockchain design for digital forensics. Our solution, Forensiblock has a novel design that automates investigation steps, ensures secure data access, traces data origins, preserves records, and expedites provenance extraction. Forensiblock incorporates Role-Based Access Control with Staged Authorization (RBAC-SA) and a distributed Merkle root for case tracking. These features support authorized resource access with an efficient retrieval of provenance records. Particularly, comparing two methods for extracting provenance records off chain storage retrieval with Merkle root verification and a brute-force search the offchain method is significantly better, especially as the blockchain size and number of cases increase. We also found that our distributed Merkle root creation slightly increases smart contract processing time but significantly improves history access. Overall, we show that Forensiblock offers secure, efficient, and reliable handling of digital forensic dataComment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl
    • …
    corecore