1,589 research outputs found
PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems
Data provenance, or data lineage, describes the life cycle of data. In
scientific workflows on HPC systems, scientists often seek diverse provenance
(e.g., origins of data products, usage patterns of datasets). Unfortunately,
existing provenance solutions cannot address the challenges due to their
incompatible provenance models and/or system implementations. In this paper, we
analyze four representative scientific workflows in collaboration with the
domain scientists to identify concrete provenance needs. Based on the
first-hand analysis, we propose a provenance framework called PROV-IO+, which
includes an I/O-centric provenance model for describing scientific data and the
associated I/O operations and environments precisely. Moreover, we build a
prototype of PROV-IO+ to enable end-to-end provenance support on real HPC
systems with little manual effort. The PROV-IO+ framework can support both
containerized and non-containerized workflows on different HPC platforms with
flexibility in selecting various classes of provenance. Our experiments with
realistic workflows show that PROV-IO+ can address the provenance needs of the
domain scientists effectively with reasonable performance (e.g., less than 3.5%
tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a
state-of-the-art system (i.e., ProvLake) in our experiments
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
A Brief Tour through Provenance in Scientific Workflows and Databases
Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.Ope
NeuroProv: Provenance data visualisation for neuroimaging analyses
© 2019 Elsevier Ltd Visualisation underpins the understanding of scientific data both through exploration and explanation of analysed data. Provenance strengthens the understanding of data by showing the process of how a result has been achieved. With the significant increase in data volumes and algorithm complexity, clinical researchers are struggling with information tracking, analysis reproducibility and the verification of scientific output. In addition, data coming from various heterogeneous sources with varying levels of trust in a collaborative environment adds to the uncertainty of the scientific outputs. This provides the motivation for provenance data capture and visualisation support for analyses. In this paper a system, NeuroProv is presented, to visualise provenance data in order to aid in the process of verification of scientific outputs, comparison of analyses, progression and evolution of results for neuroimaging analyses. The experimental results show the effectiveness of visualising provenance data for neuroimaging analyses
Using domain ontologies to help track data provenance.
Motivating example. POESIA ontologies and ontological coverages. Ontological estimation of data provenance. Ontological nets for data integration. Data integration operators. Data reconciling through articulation of ontologies. Semantic workflows. Related work. Conclusions
Information provenance for open distributed collaborative system
In autonomously managed distributed systems for collaboration, provenance can facilitate reuse of information that are interchanged, repetition of successful experiments, or to provide evidence for trust mechanisms that certain information existed at a certain period during collaboration. In this paper, we propose domain independent information provenance architecture for open collaborative distributed systems. The proposed system uses XML for interchanging information and RDF to track information provenance. The use of XML and RDF also ensures that information is universally acceptable even among heterogeneous nodes. Our proposed information provenance model can work on any operating systems or workflows.<br /
- …