2,225 research outputs found
The lifecycle of provenance metadata and its associated challenges and opportunities
This chapter outlines some of the challenges and opportunities associated
with adopting provenance principles and standards in a variety of disciplines,
including data publication and reuse, and information sciences
Causality and the semantics of provenance
Provenance, or information about the sources, derivation, custody or history
of data, has been studied recently in a number of contexts, including
databases, scientific workflows and the Semantic Web. Many provenance
mechanisms have been developed, motivated by informal notions such as
influence, dependence, explanation and causality. However, there has been
little study of whether these mechanisms formally satisfy appropriate policies
or even how to formalize relevant motivating concepts such as causality. We
contend that mathematical models of these concepts are needed to justify and
compare provenance techniques. In this paper we review a theory of causality
based on structural models that has been developed in artificial intelligence,
and describe work in progress on a causal semantics for provenance graphs.Comment: Workshop submissio
BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments
Advances in sequencing techniques have led to exponential growth in
biological data, demanding the development of large-scale bioinformatics
experiments. Because these experiments are computation- and data-intensive,
they require high-performance computing (HPC) techniques and can benefit from
specialized technologies such as Scientific Workflow Management Systems (SWfMS)
and databases. In this work, we present BioWorkbench, a framework for managing
and analyzing bioinformatics experiments. This framework automatically collects
provenance data, including both performance data from workflow execution and
data from the scientific domain of the workflow application. Provenance data
can be analyzed through a web application that abstracts a set of queries to
the provenance database, simplifying access to provenance information. We
evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree
assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a
RASopathy analysis workflow. We analyze each workflow from both computational
and scientific domain perspectives, by using queries to a provenance and
annotation database. Some of these queries are available as a pre-built feature
of the BioWorkbench web application. Through the provenance data, we show that
the framework is scalable and achieves high-performance, reducing up to 98% of
the case studies execution time. We also show how the application of machine
learning techniques can enrich the analysis process
Automatic vs Manual Provenance Abstractions: Mind the Gap
In recent years the need to simplify or to hide sensitive information in
provenance has given way to research on provenance abstraction. In the context
of scientific workflows, existing research provides techniques to semi
automatically create abstractions of a given workflow description, which is in
turn used as filters over the workflow's provenance traces. An alternative
approach that is commonly adopted by scientists is to build workflows with
abstractions embedded into the workflow's design, such as using sub-workflows.
This paper reports on the comparison of manual versus semi-automated approaches
in a context where result abstractions are used to filter report-worthy results
of computational scientific analyses. Specifically; we take a real-world
workflow containing user-created design abstractions and compare these with
abstractions created by ZOOM UserViews and Workflow Summaries systems. Our
comparison shows that semi-automatic and manual approaches largely overlap from
a process perspective, meanwhile, there is a dramatic mismatch in terms of data
artefacts retained in an abstracted account of derivation. We discuss reasons
and suggest future research directions.Comment: Preprint accepted to the 2016 workshop on the Theory and Applications
of Provenance, TAPP 201
Scientific Social Objects: The Social Objects and Multidimensional Network of the myExperiment Website
Scientific research is increasingly conducted digitally and online, and consequently we are seeing the emergence of new digital objects shared as part of the conduct and discourse of science. These Scientific Social Objects are more than lumps of domain-specific data: they may comprise multiple components which can also be shared separately and independently, and some contain descriptions of scientific processes from which new objects will be generated. Using the myExperiment social website as a case study we explore Scientific Social Objects and discuss their evolution
Data Provenance Inference in Logic Programming: Reducing Effort of Instance-driven Debugging
Data provenance allows scientists in different domains validating their models and algorithms to find out anomalies and unexpected behaviors. In previous works, we described on-the-fly interpretation of (Python) scripts to build workflow provenance graph automatically and then infer fine-grained provenance information based on the workflow provenance graph and the availability of data. To broaden the scope of our approach and demonstrate its viability, in this paper we extend it beyond procedural languages, to be used for purely declarative languages such as logic programming under the stable model semantics. For experiments and validation, we use the Answer Set Programming solver oClingo, which makes it possible to formulate and solve stream reasoning problems in a purely declarative fashion. We demonstrate how the benefits of the provenance inference over the explicit provenance still holds in a declarative setting, and we briefly discuss the potential impact for declarative programming, in particular for instance-driven debugging of the model in declarative problem solving
A Linked Data Approach to Sharing Workflows and Workflow Results
A bioinformatics analysis pipeline is often highly elaborate, due to the inherent complexity of biological systems and the variety and size of datasets. A digital equivalent of the ‘Materials and Methods’ section in wet laboratory publications would be highly beneficial to bioinformatics, for evaluating evidence and examining data across related experiments, while introducing the potential to find associated resources and integrate them as data and services. We present initial steps towards preserving bioinformatics ‘materials and methods’ by exploiting the workflow paradigm for capturing the design of a data analysis pipeline, and RDF to link the workflow, its component services, run-time provenance, and a personalized biological interpretation of the results. An example shows the reproduction of the unique graph of an analysis procedure, its results, provenance, and personal interpretation of a text mining experiment. It links data from Taverna, myExperiment.org, BioCatalogue.org, and ConceptWiki.org. The approach is relatively ‘light-weight’ and unobtrusive to bioinformatics users
- …