14,182 research outputs found
Recommended from our members
Provenance-based computing
Relying on computing systems that become increasingly complex is difficult:
with many factors potentially affecting the result of a computation or its
properties, understanding where problems appear and fixing them is a
challenging proposition. Typically, the process of finding solutions is driven
by trial and error or by experience-based insights.
In this dissertation, I examine the idea of using provenance metadata (the set
of elements that have contributed to the existence of a piece of data, together
with their relationships) instead. I show that considering provenance a
primitive of computation enables the exploration of system behaviour, targeting
both retrospective analysis (root cause analysis, performance tuning) and
hypothetical scenarios (what-if questions). In this context, provenance can be
used as part of feedback loops, with a double purpose: building software that
is able to adapt for meeting certain quality and performance targets
(semi-automated tuning) and enabling human operators to exert high-level
runtime control with limited previous knowledge of a system's internal architecture.
My contributions towards this goal are threefold: providing low-level
mechanisms for meaningful provenance collection considering OS-level resource
multiplexing, proving that such provenance data can be used in inferences about
application behaviour and generalising this to a set of primitives necessary for
fine-grained provenance disclosure in a wider context.
To derive such primitives in a bottom-up manner, I first present Resourceful, a
framework that enables capturing OS-level measurements in the context of
application activities. It is the contextualisation that allows tying the
measurements to provenance in a meaningful way, and I look at a number of
use-cases in understanding application performance. This also provides a good
setup for evaluating the impact and overheads of fine-grained provenance
collection.
I then show that the collected data enables new ways of understanding
performance variation by attributing it to specific components within a
system. The resulting set of tools, Soroban, gives developers and operation
engineers a principled way of examining the impact of various configuration, OS and virtualization parameters on application behaviour.
Finally, I consider how this supports the idea that provenance should be
disclosed at application level and discuss why such disclosure is necessary for
enabling the use of collected metadata efficiently and at a granularity which
is meaningful in relation to application semantics.CHESS Scholarship Scheme
EPSR
The lifecycle of provenance metadata and its associated challenges and opportunities
This chapter outlines some of the challenges and opportunities associated
with adopting provenance principles and standards in a variety of disciplines,
including data publication and reuse, and information sciences
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
Towards structured sharing of raw and derived neuroimaging data across existing resources
Data sharing efforts increasingly contribute to the acceleration of
scientific discovery. Neuroimaging data is accumulating in distributed
domain-specific databases and there is currently no integrated access mechanism
nor an accepted format for the critically important meta-data that is necessary
for making use of the combined, available neuroimaging data. In this
manuscript, we present work from the Derived Data Working Group, an open-access
group sponsored by the Biomedical Informatics Research Network (BIRN) and the
International Neuroimaging Coordinating Facility (INCF) focused on practical
tools for distributed access to neuroimaging data. The working group develops
models and tools facilitating the structured interchange of neuroimaging
meta-data and is making progress towards a unified set of tools for such data
and meta-data exchange. We report on the key components required for integrated
access to raw and derived neuroimaging data as well as associated meta-data and
provenance across neuroimaging resources. The components include (1) a
structured terminology that provides semantic context to data, (2) a formal
data model for neuroimaging with robust tracking of data provenance, (3) a web
service-based application programming interface (API) that provides a
consistent mechanism to access and query the data model, and (4) a provenance
library that can be used for the extraction of provenance data by image
analysts and imaging software developers. We believe that the framework and set
of tools outlined in this manuscript have great potential for solving many of
the issues the neuroimaging community faces when sharing raw and derived
neuroimaging data across the various existing database systems for the purpose
of accelerating scientific discovery
- …