64 research outputs found
A primer on provenance
Better understanding data requires tracking its history and context.</jats:p
High-Fidelity Provenance:Exploring the Intersection of Provenance and Security
In the past 25 years, the World Wide Web has disrupted the way news are disseminated and consumed. However, the euphoria for the democratization of news publishing was soon followed by scepticism, as a new phenomenon emerged: fake news. With no gatekeepers to vouch for it, the veracity of the information served over the World Wide Web became a major public concern. The Reuters Digital News Report 2020 cites that in at least half of the EU member countries, 50% or more of the population is concerned about online fake news. To help address the problem of trust on information communi- cated over the World Wide Web, it has been proposed to also make available the provenance metadata of the information. Similar to artwork provenance, this would include a detailed track of how the information was created, updated and propagated to produce the result we read, as well as what agents—human or software—were involved in the process. However, keeping track of provenance information is a non-trivial task. Current approaches, are often of limited scope and may require modifying existing applications to also generate provenance information along with thei regular output. This thesis explores how provenance can be automatically tracked in an application-agnostic manner, without having to modify the individual applications. We frame provenance capture as a data flow analysis problem and explore the use of dynamic taint analysis in this context. Our work shows that this appoach improves on the quality of provenance captured compared to traditonal approaches, yielding what we term as high-fidelity provenance. We explore the performance cost of this approach and use deterministic record and replay to bring it down to a more practical level. Furthermore, we create and present the tooling necessary for the expanding the use of using deterministic record and replay for provenance analysis. The thesis concludes with an application of high-fidelity provenance as a tool for state-of-the art offensive security analysis, based on the intuition that software too can be misguided by "fake news". This demonstrates that the potential uses of high-fidelity provenance for security extend beyond traditional forensics analysis
Distributed workflows with Jupyter
The designers of a new coordination interface enacting complex workflows have to tackle a dichotomy: choosing a language-independent or language-dependent approach. Language-independent approaches decouple workflow models from the host code's business logic and advocate portability. Language-dependent approaches foster flexibility and performance by adopting the same host language for business and coordination code. Jupyter Notebooks, with their capability to describe both imperative and declarative code in a unique format, allow taking the best of the two approaches, maintaining a clear separation between application and coordination layers but still providing a unified interface to both aspects. We advocate the Jupyter Notebooks’ potential to express complex distributed workflows, identifying the general requirements for a Jupyter-based Workflow Management System (WMS) and introducing a proof-of-concept portable implementation working on hybrid Cloud-HPC infrastructures. As a byproduct, we extended the vanilla IPython kernel with workflow-based parallel and distributed execution capabilities. The proposed Jupyter-workflow (Jw) system is evaluated on common scenarios for High Performance Computing (HPC) and Cloud, showing its potential in lowering the barriers between prototypical Notebooks and production-ready implementations
Bacatá:Notebooks for DSLs, Almost for Free
Context: Computational notebooks are a contemporary style of literate
programming, in which users can communicate and transfer knowledge by
interleaving executable code, output, and prose in a single rich document. A
Domain-Specific Language (DSL) is an artificial software language tailored for
a particular application domain. Usually, DSL users are domain experts that may
not have a software engineering background. As a consequence, they might not be
familiar with Integrated Development Environments (IDEs). Thus, the development
of tools that offer different interfaces for interacting with a DSL is
relevant.
Inquiry: However, resources available to DSL designers are limited. We would
like to leverage tools used to interact with general purpose languages in the
context of DSLs. Computational notebooks are an example of such tools. Then,
our main question is: What is an efficient and effective method of designing
and implementing notebook interfaces for DSLs? By addressing this question we
might be able to speed up the development of DSL tools, and ease the
interaction between end-users and DSLs.
Approach: In this paper, we present Bacat\'a, a mechanism for generating
notebook interfaces for DSLs in a language parametric fashion. We designed this
mechanism in a way in which language engineers can reuse as many language
components (e.g., language processors, type checkers, code generators) as
possible.
Knowledge: Our results show that notebook interfaces generated by Bacat\'a
can be automatically generated with little manual configuration. There are few
considerations and caveats that should be addressed by language engineers that
rely on language design aspects. The creation of a notebook for a DSL with
Bacat\'a becomes a matter of writing the code that wires existing language
components in the Rascal language workbench with the Jupyter platform.
Grounding: We evaluate Bacat\'a by generating functional computational
notebook interfaces for three different non-trivial DSLs, namely: a small
subset of Halide (a DSL for digital image processing), SweeterJS (an extended
version of JavaScript), and QL (a DSL for questionnaires). Additionally, it is
relevant to generate notebook implementations rather than implementing them
manually. We measured and compared the number of Source Lines of Code (SLOCs)
that we reused from existing implementations of those languages.
Importance: The adoption of notebooks by novice-programmers and end-users has
made them very popular in several domains such as exploratory programming, data
science, data journalism, and machine learning. Why are they popular? In (data)
science, it is essential to make results reproducible as well as
understandable. However, notebooks are only available for GPLs. This paper
opens up the notebook metaphor for DSLs to improve the end-user experience when
interacting with code and to increase DSLs adoption
Recommended from our members
Provenance-based computing
Relying on computing systems that become increasingly complex is difficult:
with many factors potentially affecting the result of a computation or its
properties, understanding where problems appear and fixing them is a
challenging proposition. Typically, the process of finding solutions is driven
by trial and error or by experience-based insights.
In this dissertation, I examine the idea of using provenance metadata (the set
of elements that have contributed to the existence of a piece of data, together
with their relationships) instead. I show that considering provenance a
primitive of computation enables the exploration of system behaviour, targeting
both retrospective analysis (root cause analysis, performance tuning) and
hypothetical scenarios (what-if questions). In this context, provenance can be
used as part of feedback loops, with a double purpose: building software that
is able to adapt for meeting certain quality and performance targets
(semi-automated tuning) and enabling human operators to exert high-level
runtime control with limited previous knowledge of a system's internal architecture.
My contributions towards this goal are threefold: providing low-level
mechanisms for meaningful provenance collection considering OS-level resource
multiplexing, proving that such provenance data can be used in inferences about
application behaviour and generalising this to a set of primitives necessary for
fine-grained provenance disclosure in a wider context.
To derive such primitives in a bottom-up manner, I first present Resourceful, a
framework that enables capturing OS-level measurements in the context of
application activities. It is the contextualisation that allows tying the
measurements to provenance in a meaningful way, and I look at a number of
use-cases in understanding application performance. This also provides a good
setup for evaluating the impact and overheads of fine-grained provenance
collection.
I then show that the collected data enables new ways of understanding
performance variation by attributing it to specific components within a
system. The resulting set of tools, Soroban, gives developers and operation
engineers a principled way of examining the impact of various configuration, OS and virtualization parameters on application behaviour.
Finally, I consider how this supports the idea that provenance should be
disclosed at application level and discuss why such disclosure is necessary for
enabling the use of collected metadata efficiently and at a granularity which
is meaningful in relation to application semantics.CHESS Scholarship Scheme
EPSR
A provenance-based semantic approach to support understandability, reproducibility, and reuse of scientific experiments
Understandability and reproducibility of scientific results are vital in every field of science. Several reproducibility measures are being taken to make the data used in the publications findable and accessible. However, there are many challenges faced by scientists from the beginning of an experiment to the end in particular for data management. The explosive growth of heterogeneous research data and understanding how this data has been derived is one of the research problems faced in this context. Interlinking the data, the steps and the results from the computational and non-computational processes of a scientific experiment is important for the reproducibility. We introduce the notion of end-to-end provenance management'' of scientific experiments to help scientists understand and reproduce the experimental results. The main contributions of this thesis are: (1) We propose a provenance modelREPRODUCE-ME'' to describe the scientific experiments using semantic web technologies by extending existing standards. (2) We study computational reproducibility and important aspects required to achieve it. (3) Taking into account the REPRODUCE-ME provenance model and the study on computational reproducibility, we introduce our tool, ProvBook, which is designed and developed to demonstrate computational reproducibility. It provides features to capture and store provenance of Jupyter notebooks and helps scientists to compare and track their results of different executions. (4) We provide a framework, CAESAR (CollAborative Environment for Scientific Analysis with Reproducibility) for the end-to-end provenance management. This collaborative framework allows scientists to capture, manage, query and visualize the complete path of a scientific experiment consisting of computational and non-computational steps in an interoperable way. We apply our contributions to a set of scientific experiments in microscopy research projects
Workflow models for heterogeneous distributed systems
The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security.
Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow.
The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology.
Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures
Big Ideas paper: Policy-driven middleware for a legally-compliant Internet of Things.
Internet of Things (IoT) applications, systems and services
are subject to law. We argue that for the IoT to develop
lawfully, there must be technical mechanisms that allow the
enforcement of speci ed policy, such that systems align with
legal realities. The audit of policy enforcement must assist
the apportionment of liability, demonstrate compliance with
regulation, and indicate whether policy correctly captures le-
gal responsibilities. As both systems and obligations evolve
dynamically, this cycle must be continuously maintained.
This poses a huge challenge given the global scale of the
IoT vision. The IoT entails dynamically creating new ser-
vices through
managed and exible data exchange
.
Data management is complex in this dynamic environment,
given the need to both control and share information, often
across federated domains of administration.
We see middleware playing a key role in managing the
IoT. Our vision is for a middleware-enforced, uni ed policy
model that applies end-to-end, throughout the IoT. This is
because policy cannot be bound to things, applications, or
administrative domains, since functionality is the result of
composition, with dynamically formed chains of data ows.
We have investigated the use of Information Flow Control
(IFC) to manage and audit data ows in cloud computing;
a domain where trust can be well-founded, regulations are
more mature and associated responsibilities clearer. We feel
that IFC has great potential in the broader IoT context.
However, the sheer scale and the dynamic, federated nature
of the IoT pose a number of signi cant research challenges
Explainable and Resource-Efficient Stream Processing Through Provenance and Scheduling
In our era of big data, information is captured at unprecedented volumes and velocities, with technologies such as Cyber-Physical Systems making quick decisions based on the processing of streaming, unbounded datasets. In such scenarios, it can be beneficial to process the data in an online manner, using the stream processing paradigm implemented by Stream Processing Engines (SPEs). While SPEs enable high-throughput, low-latency analysis, they are faced with challenges connected to evolving deployment scenarios, like the increasing use of heterogeneous, resource-constrained edge devices together with cloud resources and the increasing user expectations for usability, control, and resource-efficiency, on par with features provided by traditional databases.This thesis tackles open challenges regarding making stream processing more user-friendly, customizable, and resource-efficient. The first part outlines our work, providing high-level background information, descriptions of the research problems, and our contributions. The second part presents our three state-of-the-art frameworks for explainable data streaming using data provenance, which can help users of streaming queries to identify important data points, explain unexpected behaviors, and aid query understanding and debugging. (A) GeneaLog provides backward provenance allowing users to identify the inputs that contributed to the generation of each output of a streaming query. (B) Ananke is the first framework to provide a duplicate-free graph of live forward provenance, enabling easy bidirectional tracing of input-output relationships in streaming queries and identifying data points that have finished contributing to results. (C) Erebus is the first framework that allows users to define expectations about the results of a streaming query, validating whether these expectations are met or providing explanations in the form of why-not provenance otherwise. The third part presents techniques for execution efficiency through custom scheduling, introducing our state-of-the-art scheduling frameworks that control resource allocation and achieve user-defined performance goals. (D) Haren is an SPE-agnostic user-level scheduler that can efficiently enforce user-defined scheduling policies. (E) Lachesis is a standalone scheduling middleware that requires no changes to SPEs but, instead, directly guides the scheduling decisions of the underlying Operating System. Our extensive evaluations using real-world SPEs and workloads show that our work significantly improves over the state-of-the-art while introducing only small performance overheads
- …