26,092 research outputs found
UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems
A crucial challenge for scientific workflow management systems is to support the efficient and scalable storage and querying of large provenance datasets that record the history of in silico experiments. As new provenance management systems are being developed, it is important to have benchmarks that can evaluate these systems and provide an unbiased comparison. In this paper, based on the requirements for scientific workflow provenance systems, we design an extensible benchmark that features a collection of techniques and tools for workload generation, query selection, performance measurement, and experimental result interpretation
Provenance-enabled Packet Path Tracing in the RPL-based Internet of Things
The interconnection of resource-constrained and globally accessible things
with untrusted and unreliable Internet make them vulnerable to attacks
including data forging, false data injection, and packet drop that affects
applications with critical decision-making processes. For data trustworthiness,
reliance on provenance is considered to be an effective mechanism that tracks
both data acquisition and data transmission. However, provenance management for
sensor networks introduces several challenges, such as low energy, bandwidth
consumption, and efficient storage. This paper attempts to identify packet drop
(either maliciously or due to network disruptions) and detect faulty or
misbehaving nodes in the Routing Protocol for Low-Power and Lossy Networks
(RPL) by following a bi-fold provenance-enabled packed path tracing (PPPT)
approach. Firstly, a system-level ordered-provenance information encapsulates
the data generating nodes and the forwarding nodes in the data packet.
Secondly, to closely monitor the dropped packets, a node-level provenance in
the form of the packet sequence number is enclosed as a routing entry in the
routing table of each participating node. Lossless in nature, both approaches
conserve the provenance size satisfying processing and storage requirements of
IoT devices. Finally, we evaluate the efficacy of the proposed scheme with
respect to provenance size, provenance generation time, and energy consumption.Comment: 14 pages, 18 Figure
A Forensic Enabled Data Provenance Model for Public Cloud
Cloud computing is a newly emerging technology where storage, computation and services are extensively shared among a large number of users through virtualization and distributed computing. This technology makes the process of detecting the physical location or ownership of a particular piece of data even more complicated. As a result, improvements in data provenance techniques became necessary. Provenance refers to the record describing the origin and other historical information about a piece of data. An advanced data provenance system will give forensic investigators a transparent idea about the data\u27s lineage, and help to resolve disputes over controversial pieces of data by providing digital evidence. In this paper, the challenges of cloud architecture are identified, how this affects the existing forensic analysis and provenance techniques is discussed, and a model for efficient provenance collection and forensic analysis is proposed
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
bdbms -- A Database Management System for Biological Data
Biologists are increasingly using databases for storing and managing their
data. Biological databases typically consist of a mixture of raw data,
metadata, sequences, annotations, and related data obtained from various
sources. Current database technology lacks several functionalities that are
needed by biological databases. In this paper, we introduce bdbms, an
extensible prototype database management system for supporting biological data.
bdbms extends the functionalities of current DBMSs to include: (1) Annotation
and provenance management including storage, indexing, manipulation, and
querying of annotation and provenance as first class objects in bdbms, (2)
Local dependency tracking to track the dependencies and derivations among data
items, (3) Update authorization to support data curation via content-based
authorization, in contrast to identity-based authorization, and (4) New access
methods and their supporting operators that support pattern matching on various
types of compressed biological data types. This paper presents the design of
bdbms along with the techniques proposed to support these functionalities
including an extension to SQL. We also outline some open issues in building
bdbms.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
A unified framework for managing provenance information in translational research
<p>Abstract</p> <p>Background</p> <p>A critical aspect of the NIH <it>Translational Research </it>roadmap, which seeks to accelerate the delivery of "bench-side" discoveries to patient's "bedside," is the management of the <it>provenance </it>metadata that keeps track of the origin and history of data resources as they traverse the path from the bench to the bedside and back. A comprehensive provenance framework is essential for researchers to verify the quality of data, reproduce scientific results published in peer-reviewed literature, validate scientific process, and associate trust value with data and results. Traditional approaches to provenance management have focused on only partial sections of the translational research life cycle and they do not incorporate "domain semantics", which is essential to support domain-specific querying and analysis by scientists.</p> <p>Results</p> <p>We identify a common set of challenges in managing provenance information across the <it>pre-publication </it>and <it>post-publication </it>phases of data in the translational research lifecycle. We define the semantic provenance framework (SPF), underpinned by the Provenir upper-level provenance ontology, to address these challenges in the four stages of provenance metadata:</p> <p>(a) Provenance <b>collection </b>- during data generation</p> <p>(b) Provenance <b>representation </b>- to support interoperability, reasoning, and incorporate domain semantics</p> <p>(c) Provenance <b>storage </b>and <b>propagation </b>- to allow efficient storage and seamless propagation of provenance as the data is transferred across applications</p> <p>(d) Provenance <b>query </b>- to support queries with increasing complexity over large data size and also support knowledge discovery applications</p> <p>We apply the SPF to two exemplar translational research projects, namely the Semantic Problem Solving Environment for <it>Trypanosoma cruzi </it>(<it>T.cruzi </it>SPSE) and the Biomedical Knowledge Repository (BKR) project, to demonstrate its effectiveness.</p> <p>Conclusions</p> <p>The SPF provides a unified framework to effectively manage provenance of translational research data during pre and post-publication phases. This framework is underpinned by an upper-level provenance ontology called Provenir that is extended to create domain-specific provenance ontologies to facilitate provenance interoperability, seamless propagation of provenance, automated querying, and analysis.</p
ForensiBlock: A Provenance-Driven Blockchain Framework for Data Forensics and Auditability
Maintaining accurate provenance records is paramount in digital forensics, as
they underpin evidence credibility and integrity, addressing essential aspects
like accountability and reproducibility. Blockchains have several properties
that can address these requirements. Previous systems utilized public
blockchains, i.e., treated blockchain as a black box, and benefiting from the
immutability property. However, the blockchain was accessible to everyone,
giving rise to security concerns and moreover, efficient extraction of
provenance faces challenges due to the enormous scale and complexity of digital
data. This necessitates a tailored blockchain design for digital forensics. Our
solution, Forensiblock has a novel design that automates investigation steps,
ensures secure data access, traces data origins, preserves records, and
expedites provenance extraction. Forensiblock incorporates Role-Based Access
Control with Staged Authorization (RBAC-SA) and a distributed Merkle root for
case tracking. These features support authorized resource access with an
efficient retrieval of provenance records. Particularly, comparing two methods
for extracting provenance records off chain storage retrieval with Merkle root
verification and a brute-force search the offchain method is significantly
better, especially as the blockchain size and number of cases increase. We also
found that our distributed Merkle root creation slightly increases smart
contract processing time but significantly improves history access. Overall, we
show that Forensiblock offers secure, efficient, and reliable handling of
digital forensic dataComment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
- …