Search CORE

10 research outputs found

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

Author: Dunning T.
Grafberger S.
Schelter S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

International Migration, Integration and Social Cohesion online publications

Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines

Author: Grafberger S.
Groth P.
Schelter S.
Publication venue
Publication date: 01/06/2023
Field of study

Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this pipeline variant to see how the change impacts the pipeline's output score. The application of existing analysis techniques to ML pipelines is technically challenging as they are hard to integrate into existing pipeline code and their execution introduces large overheads due to repeated work.We propose mlwhatif to address these integration and efficiency challenges for data-centric what-if analyses on ML pipelines. mlwhatif enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. Our approach employs pipeline patches to specify changes to the data, operators and models of a pipeline. Based on these patches, we define a multi-query optimizer for efficiently executing the resulting pipeline variants jointly, with four subsumption-based optimization rules. Subsequently, we detail how to implement the pipeline variant generation and optimizer of mlwhatif. For that, we instrument native ML pipelines written in Python to extract dataflow plans with re-executable operators.We experimentally evaluate mlwhatif, and find that its speedup scales linearly with the number of pipeline variants in applicable cases, and is invariant to the input data size. In end-to-end experiments with four analyses on more than 60 pipelines, we show speedups of up to 13x compared to sequential execution, and find that the speedup is invariant to the model and featurization in the pipeline. Furthermore, we confirm the low instrumentation overhead of mlwhatif

International Migration, Integration and Social Cohesion online publications

UvA-DARE

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

Author: Dunning T.
Grafberger S.
Schelter S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

International Migration, Integration and Social Cohesion online publications

MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines

Author: Grafberger S.
Guha S.
Schelter S.
Stoyanovich J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

International Migration, Integration and Social Cohesion online publications

Mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses over and Over?

Author: Grafberger S.
Groth P.
Guha S.
Schelter S.
Publication venue
Publication date: 01/08/2023
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Data distribution debugging in machine learning pipelines

Author: Grafberger S.
Groth P.
Schelter S.
Stoyanovich J.
Publication venue
Publication date: 01/09/2022
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines

Author: Grafberger S.
Guha S.
Schelter S.
Stoyanovich J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Towards data-centric what-if analysis for native machine learning pipelines

Author: Grafberger S.
Groth P.
Schelter S.
Publication venue
Publication date: 01/01/2022
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning

Author: Dunning T.
Grafberger S.
Schelter S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Provenance Tracking for End-to-End Machine Learning Pipelines

Author: Grafberger S.
Groth P.
Schelter S.
Publication venue
Publication date: 01/01/2023
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE