616 research outputs found
The lifecycle of provenance metadata and its associated challenges and opportunities
This chapter outlines some of the challenges and opportunities associated
with adopting provenance principles and standards in a variety of disciplines,
including data publication and reuse, and information sciences
Labeling Workflow Views with Fine-Grained Dependencies
This paper considers the problem of efficiently answering reachability
queries over views of provenance graphs, derived from executions of workflows
that may include recursion. Such views include composite modules and model
fine-grained dependencies between module inputs and outputs. A novel
view-adaptive dynamic labeling scheme is developed for efficient query
evaluation, in which view specifications are labeled statically (i.e. as they
are created) and data items are labeled dynamically as they are produced during
a workflow execution. Although the combination of fine-grained dependencies and
recursive workflows entail, in general, long (linear-size) data labels, we show
that for a large natural class of workflows and views, labels are compact
(logarithmic-size) and reachability queries can be evaluated in constant time.
Experimental results demonstrate the benefit of this approach over the
state-of-the-art technique when applied for labeling multiple views.Comment: VLDB201
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
Scalable And Secure Provenance Querying For Scientific Workflows And Its Application In Autism Study
In the era of big data, scientific workflows have become essential to automate scientific experiments and guarantee repeatability. As both data and workflow increase in their scale, requirements for having a data lineage management system commensurate with the complexity of the workflow also become necessary, calling for new scalable storage, query, and analytics infrastructure. This system that manages and preserves the derivation history and morphosis of data, known as provenance system, is essential for maintaining quality and trustworthiness of data products and ensuring reproducibility of scientific discoveries. With a flurry of research and increased adoption of scientific workflows in processing sensitive data, i.e., health and medication domain, securing information flow and instrumenting access privileges in the system have become a fundamental precursor to deploying large-scale scientific workflows. That has become more important now since today team of scientists around the world can collaborate on experiments using globally distributed sensitive data sources. Hence, it has become imperative to augment scientific workflow systems as well as the underlying provenance management systems with data security protocols. Provenance systems, void of data security protocol, are susceptible to vulnerability. In this dissertation research, we delineate how scientific workflows can improve therapeutic practices in autism spectrum disorders. The data-intensive computation inherent in these workflows and sensitive nature of the data, necessitate support for scalable, parallel and robust provenance queries and secured view of data. With that in perspective, we propose , a parallel, robust, reliable and scalable provenance query language and introduce the concept of access privilege inheritance in the provenance systems. We characterize desirable properties of role-based access control protocol in scientific workflows and demonstrate how the qualities are integrated into the workflow provenance systems as well. Finally, we describe how these concepts fit within the DATAVIEW workflow management system
Recommended from our members
A primer on provenance
Better understanding data requires tracking its history and context.</jats:p
Metadata and provenance management
Scientists today collect, analyze, and generate TeraBytes and PetaBytes of
data. These data are often shared and further processed and analyzed among
collaborators. In order to facilitate sharing and data interpretations, data
need to carry with it metadata about how the data was collected or generated,
and provenance information about how the data was processed. This chapter
describes metadata and provenance in the context of the data lifecycle. It also
gives an overview of the approaches to metadata and provenance management,
followed by examples of how applications use metadata and provenance in their
scientific processes
Querying and managing opm-compliant scientific workflow provenance
Provenance, the metadata that records the derivation history of scientific results, is important in scientific workflows to interpret, validate, and analyze the result of scientific computing. Recently,
to promote and facilitate interoperability among heterogeneous provenance systems, the Open Provenance Model (OPM) has been proposed and has played an important role in the community.
In this dissertation, to efficiently query and manage OPM-compliant provenance, we first propose a provenance collection framework that collects both prospective provenance, which captures
an abstract workflow specification as a recipe for future data derivation and retrospective provenance, which captures past workflow execution and data derivation information. We then
propose a relational database-based provenance system, called OPMPROV that stores, reasons, and queries prospective and retrospective provenance, which is OPM-compliant provenance. We finally propose OPQL, an OPM-level provenance query language, that is directly defined over the OPM model. An OPQL query takes an
OPM graph as input and produces an OPM graph as output; therefore, OPQL queries are not tightly coupled to the underlying provenance storage strategies. Our provenance store, provenance collection framework, and provenance query language feature the native support of the OPM model
- …