14,914 research outputs found
On the Limitations of Provenance for Queries With Difference
The annotation of the results of database transformations was shown to be
very effective for various applications. Until recently, most works in this
context focused on positive query languages. The provenance semirings is a
particular approach that was proven effective for these languages, and it was
shown that when propagating provenance with semirings, the expected equivalence
axioms of the corresponding query languages are satisfied. There have been
several attempts to extend the framework to account for relational algebra
queries with difference. We show here that these suggestions fail to satisfy
some expected equivalence axioms (that in particular hold for queries on
"standard" set and bag databases). Interestingly, we show that this is not a
pitfall of these particular attempts, but rather every such attempt is bound to
fail in satisfying these axioms, for some semirings. Finally, we show
particular semirings for which an extension for supporting difference is
(im)possible.Comment: TAPP 201
A posteriori metadata from automated provenance tracking: Integration of AiiDA and TCOD
In order to make results of computational scientific research findable,
accessible, interoperable and re-usable, it is necessary to decorate them with
standardised metadata. However, there are a number of technical and practical
challenges that make this process difficult to achieve in practice. Here the
implementation of a protocol is presented to tag crystal structures with their
computed properties, without the need of human intervention to curate the data.
This protocol leverages the capabilities of AiiDA, an open-source platform to
manage and automate scientific computational workflows, and TCOD, an
open-access database storing computed materials properties using a well-defined
and exhaustive ontology. Based on these, the complete procedure to deposit
computed data in the TCOD database is automated. All relevant metadata are
extracted from the full provenance information that AiiDA tracks and stores
automatically while managing the calculations. Such a protocol also enables
reproducibility of scientific data in the field of computational materials
science. As a proof of concept, the AiiDA-TCOD interface is used to deposit 170
theoretical structures together with their computed properties and their full
provenance graphs, consisting in over 4600 AiiDA nodes
Provenance Threat Modeling
Provenance systems are used to capture history metadata, applications include
ownership attribution and determining the quality of a particular data set.
Provenance systems are also used for debugging, process improvement,
understanding data proof of ownership, certification of validity, etc. The
provenance of data includes information about the processes and source data
that leads to the current representation. In this paper we study the security
risks provenance systems might be exposed to and recommend security solutions
to better protect the provenance information.Comment: 4 pages, 1 figure, conferenc
An Architecture for Provenance Systems
This document covers the logical and process architectures of provenance systems. The logical architecture identifies key roles and their interactions, whereas the process architecture discusses distribution and security. A fundamental aspect of our presentation is its technology-independent nature, which makes it reusable: the principles that are exposed in this document may be applied to different technologies
Architecture for Provenance Systems
This document covers the logical and process architectures of provenance systems. The logical architecture identifies key roles and their interactions, whereas the process architecture discusses distribution and security. A fundamental aspect of our presentation is its technology-independent nature, which makes it reusable: the principles that are exposed in this document may be applied to different technologies
Approximation with Error Bounds in Spark
We introduce a sampling framework to support approximate computing with
estimated error bounds in Spark. Our framework allows sampling to be performed
at the beginning of a sequence of multiple transformations ending in an
aggregation operation. The framework constructs a data provenance tree as the
computation proceeds, then combines the tree with multi-stage sampling and
population estimation theories to compute error bounds for the aggregation.
When information about output keys are available early, the framework can also
use adaptive stratified reservoir sampling to avoid (or reduce) key losses in
the final output and to achieve more consistent error bounds across popular and
rare keys. Finally, the framework includes an algorithm to dynamically choose
sampling rates to meet user specified constraints on the CDF of error bounds in
the outputs. We have implemented a prototype of our framework called
ApproxSpark, and used it to implement five approximate applications from
different domains. Evaluation results show that ApproxSpark can (a)
significantly reduce execution time if users can tolerate small amounts of
uncertainties and, in many cases, loss of rare keys, and (b) automatically find
sampling rates to meet user specified constraints on error bounds. We also
explore and discuss extensively trade-offs between sampling rates, execution
time, accuracy and key loss
Virtual Data in CMS Analysis
The use of virtual data for enhancing the collaboration between large groups
of scientists is explored in several ways:
- by defining ``virtual'' parameter spaces which can be searched and shared
in an organized way by a collaboration of scientists in the course of their
analysis;
- by providing a mechanism to log the provenance of results and the ability
to trace them back to the various stages in the analysis of real or simulated
data;
- by creating ``check points'' in the course of an analysis to permit
collaborators to explore their own analysis branches by refining selections,
improving the signal to background ratio, varying the estimation of parameters,
etc.;
- by facilitating the audit of an analysis and the reproduction of its
results by a different group, or in a peer review context.
We describe a prototype for the analysis of data from the CMS experiment
based on the virtual data system Chimera and the object-oriented data analysis
framework ROOT. The Chimera system is used to chain together several steps in
the analysis process including the Monte Carlo generation of data, the
simulation of detector response, the reconstruction of physics objects and
their subsequent analysis, histogramming and visualization using the ROOT
framework.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 9 pages, LaTeX, 7 eps figures. PSN
TUAT010. V2 - references adde
- ā¦