Successful data-driven science requires complex data engineering pipelines to
clean, transform, and alter data in preparation for machine learning, and
robust results can only be achieved when each step in the pipeline can be
justified, and its effect on the data explained. In this framework, our aim is
to provide data scientists with facilities to gain an in-depth understanding of
how each step in the pipeline affects the data, from the raw input to training
sets ready to be used for learning. Starting from an extensible set of data
preparation operators commonly used within a data science setting, in this work
we present a provenance management infrastructure for generating, storing, and
querying very granular accounts of data transformations, at the level of
individual elements within datasets whenever possible. Then, from the formal
definition of a core set of data science preprocessing operators, we derive a
provenance semantics embodied by a collection of templates expressed in PROV, a
standard model for data provenance. Using those templates as a reference, our
provenance generation algorithm generalises to any operator with observable
input/output pairs. We provide a prototype implementation of an
application-level provenance capture library to produce, in a semi-automatic
way, complete provenance documents that account for the entire pipeline. We
report on the ability of our implementations to capture provenance in real ML
benchmark pipelines and over TCP-DI synthetic data. We finally show how the
collected provenance can be used to answer a suite of provenance benchmark
queries that underpin some common pipeline inspection questions, as expressed
on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa