34,498 research outputs found
Provenance in scientific workflow systems
Journal ArticleThe automated tracking and storage of provenance information promises to be a major advantage of scientific workflow systems. We discuss issues related to data and workflow provenance, and present techniques for focusing user attention on meaningful provenance through "user views," for managing the provenance of nested scientific data, and for using information about the evolution of a workflow specification to understand the difference in the provenance of similar data products
Automatic vs Manual Provenance Abstractions: Mind the Gap
In recent years the need to simplify or to hide sensitive information in
provenance has given way to research on provenance abstraction. In the context
of scientific workflows, existing research provides techniques to semi
automatically create abstractions of a given workflow description, which is in
turn used as filters over the workflow's provenance traces. An alternative
approach that is commonly adopted by scientists is to build workflows with
abstractions embedded into the workflow's design, such as using sub-workflows.
This paper reports on the comparison of manual versus semi-automated approaches
in a context where result abstractions are used to filter report-worthy results
of computational scientific analyses. Specifically; we take a real-world
workflow containing user-created design abstractions and compare these with
abstractions created by ZOOM UserViews and Workflow Summaries systems. Our
comparison shows that semi-automatic and manual approaches largely overlap from
a process perspective, meanwhile, there is a dramatic mismatch in terms of data
artefacts retained in an abstracted account of derivation. We discuss reasons
and suggest future research directions.Comment: Preprint accepted to the 2016 workshop on the Theory and Applications
of Provenance, TAPP 201
Data Workflow - A Workflow Model for Continuous Data Processing
Online data or streaming data are getting more and more important for enterprise information systems, e.g. by integrating sensor data and workflows. The continuous flow of data provided e.g. by sensors requires new workflow models addressing the data perspective of these applications, since continuous data is potentially infinite while business process instances are always finite.\ud
In this paper a formal workflow model is proposed with data driven coordination and explicating properties of the continuous data processing. These properties can be used to optimize data workflows, i.e., reducing the computational power for processing the workflows in an engine by reusing intermediate processing results in several workflows
Designing Traceability into Big Data Systems
Providing an appropriate level of accessibility and traceability to data or
process elements (so-called Items) in large volumes of data, often
Cloud-resident, is an essential requirement in the Big Data era.
Enterprise-wide data systems need to be designed from the outset to support
usage of such Items across the spectrum of business use rather than from any
specific application view. The design philosophy advocated in this paper is to
drive the design process using a so-called description-driven approach which
enriches models with meta-data and description and focuses the design process
on Item re-use, thereby promoting traceability. Details are given of the
description-driven design of big data systems at CERN, in health informatics
and in business process management. Evidence is presented that the approach
leads to design simplicity and consequent ease of management thanks to loose
typing and the adoption of a unified approach to Item management and usage.Comment: 10 pages; 6 figures in Proceedings of the 5th Annual International
Conference on ICT: Big Data, Cloud and Security (ICT-BDCS 2015), Singapore
July 2015. arXiv admin note: text overlap with arXiv:1402.5764,
arXiv:1402.575
Interactive Visual Analysis of Networked Systems: Workflows for Two Industrial Domains
We report on a first study of interactive visual analysis of networked systems. Working with ABB Corporate Research and Ericsson Research, we have created workflows which demonstrate the potential of visualization in the domains of industrial automation and telecommunications. By a workflow in this context, we mean a sequence of visualizations and the actions for generating them. Visualizations can be any images that represent properties of the data sets analyzed, and actions typically either change the selection of data visualized or change the visualization by choice of technique or change of parameters
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
- …