4 research outputs found
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance
Successful data-driven science requires complex data engineering pipelines to
clean, transform, and alter data in preparation for machine learning, and
robust results can only be achieved when each step in the pipeline can be
justified, and its effect on the data explained. In this framework, our aim is
to provide data scientists with facilities to gain an in-depth understanding of
how each step in the pipeline affects the data, from the raw input to training
sets ready to be used for learning. Starting from an extensible set of data
preparation operators commonly used within a data science setting, in this work
we present a provenance management infrastructure for generating, storing, and
querying very granular accounts of data transformations, at the level of
individual elements within datasets whenever possible. Then, from the formal
definition of a core set of data science preprocessing operators, we derive a
provenance semantics embodied by a collection of templates expressed in PROV, a
standard model for data provenance. Using those templates as a reference, our
provenance generation algorithm generalises to any operator with observable
input/output pairs. We provide a prototype implementation of an
application-level provenance capture library to produce, in a semi-automatic
way, complete provenance documents that account for the entire pipeline. We
report on the ability of our implementations to capture provenance in real ML
benchmark pipelines and over TCP-DI synthetic data. We finally show how the
collected provenance can be used to answer a suite of provenance benchmark
queries that underpin some common pipeline inspection questions, as expressed
on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa
PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
The ubiquitous use of machine learning algorithms brings new challenges to
traditional database problems such as incremental view update. Much effort is
being put in better understanding and debugging machine learning models, as
well as in identifying and repairing errors in training datasets. Our focus is
on how to assist these activities when they have to retrain the machine
learning model after removing problematic training samples in cleaning or
selecting different subsets of training data for interpretability. This paper
presents an efficient provenance-based approach, PrIU, and its optimized
version, PrIU-opt, for incrementally updating model parameters without
sacrificing prediction accuracy. We prove the correctness and convergence of
the incrementally updated model parameters, and validate it experimentally.
Experimental results show that up to two orders of magnitude speed-ups can be
achieved by PrIU-opt compared to simply retraining the model from scratch, yet
obtaining highly similar models.Comment: 28 Pages, published in 2020 ACM SIGMOD International Conference on
Management of Data (SIGMOD 2020
Fine-Grained Provenance And Applications To Data Analytics Computation
Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads