24 research outputs found
Practical whole-system provenance capture
Data provenance describes how data came to be in its present form. It includes data sources and the transformations that have been applied to them. Data provenance has many uses, from forensics and security to aiding the reproducibility of scientific experiments. We present CamFlow, a whole-system provenance capture mechanism that integrates easily into a PaaS offering. While there have been several prior whole-system provenance systems that captured a comprehensive, systemic and ubiquitous record of a system’s behavior, none have been widely adopted. They either A) impose too much overhead, B) are designed for long-outdated kernel releases and are hard to port to current systems, C) generate too much data, or D) are designed for a single system. CamFlow addresses these shortcoming by: 1) leveraging the latest kernel design advances to achieve efficiency; 2) using a self-contained, easily maintainable implementation relying on a Linux Security Module, NetFilter, and other existing kernel facilities; 3) providing a mechanism to tailor the captured provenance data to the needs of the application; and 4) making it easy to integrate provenance across distributed systems. The provenance we capture is streamed and consumed by tenant-built auditor applications. We illustrate the usability of our implementation by describing three such applications: demonstrating compliance with data regulations; performing fault/intrusion detection; and implementing data loss prevention. We also show how CamFlow can be leveraged to capture meaningful provenance without modifying existing applications.Engineering and Applied Science
Xanthus: Push-button Orchestration of Host Provenance Data Collection
Host-based anomaly detectors generate alarms by inspecting audit logs for
suspicious behavior. Unfortunately, evaluating these anomaly detectors is hard.
There are few high-quality, publicly-available audit logs, and there are no
pre-existing frameworks that enable push-button creation of realistic system
traces. To make trace generation easier, we created Xanthus, an automated tool
that orchestrates virtual machines to generate realistic audit logs. Using
Xanthus' simple management interface, administrators select a base VM image,
configure a particular tracing framework to use within that VM, and define
post-launch scripts that collect and save trace data. Once data collection is
finished, Xanthus creates a self-describing archive, which contains the VM, its
configuration parameters, and the collected trace data. We demonstrate that
Xanthus hides many of the tedious (yet subtle) orchestration tasks that humans
often get wrong; Xanthus avoids mistakes that lead to non-replicable
experiments.Comment: 6 pages, 1 figure, 7 listings, 1 table, worksho
CWLProv - Interoperable Retrospective Provenance capture and its challenges
<p>The automation of data analysis in the form of scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable <strong>A</strong>utomation, <strong>S</strong>caling, <strong>A</strong>daption and <strong>P</strong>rovenance support (ASAP).</p>
<p>However, there are still several challenges associated with the effective sharing, publication, understandability and reproducibility of such workflows due to the incomplete capture of provenance and the dependence on particular technical (software) platforms. This paper presents <strong>CWLProv</strong>, an approach for retrospective provenance capture utilizing open source community-driven standards involving application and customization of workflow-centric <a href="http://www.researchobject.org/">Research Objects</a> (ROs).</p>
<p>The ROs are produced as an output of a workflow enactment defined in the <a href="http://www.commonwl.org/">Common Workflow Language</a> (CWL) using the CWL reference implementation and its data structures. The approach aggregates and annotates all the resources involved in the scientific investigation including inputs, outputs, workflow specification, command line tool specifications and input parameter settings. The resources are linked within the RO to enable re-enactment of an analysis without depending on external resources.</p>
<p>The workflow provenance profile is represented in W3C recommended standard <a href="https://www.w3.org/TR/prov-n/">PROV-N</a> and <a href="https://www.w3.org/Submission/prov-json/">PROV-JSON</a> format to capture retrospective provenance of the workflow enactment. The workflow-centric RO produced as an output of a CWL workflow enactment is expected to be interoperable, reusable, shareable and portable across different plat-<br>
forms.</p>
<p>This paper describes the need and motivation for <a href="https://github.com/common-workflow-language/cwltool/tree/provenance">CWLProv</a> and the lessons learned in applying it for ROs using CWL in the bioinformatics domain.</p
Flexible graph matching and graph edit distance using answer set programming
The graph isomorphism, subgraph isomorphism, and graph edit distance problems
are combinatorial problems with many applications. Heuristic exact and
approximate algorithms for each of these problems have been developed for
different kinds of graphs: directed, undirected, labeled, etc. However,
additional work is often needed to adapt such algorithms to different classes
of graphs, for example to accommodate both labels and property annotations on
nodes and edges. In this paper, we propose an approach based on answer set
programming. We show how each of these problems can be defined for a general
class of property graphs with directed edges, and labels and key-value
properties annotating both nodes and edges. We evaluate this approach on a
variety of synthetic and realistic graphs, demonstrating that it is feasible as
a rapid prototyping approach.Comment: To appear, PADL 202
Towards Specificationless Monitoring of Provenance-Emitting Systems
Monitoring often requires insight into the monitored system as well as concrete specifications of expected behavior. More and more systems, however, provide information about their inner procedures by emitting provenance information in a W3C-standardized graph format.
In this work, we present an approach to monitor such provenance data for anomalous behavior by performing spectral graph analysis on slices of the constructed provenance graph and by comparing the characteristics of each slice with those of a sliding window over recently seen slices. We argue that this approach not only simplifies the monitoring of heterogeneous distributed systems, but also enables applying a host of well-studied techniques to monitor such systems