Complaint-driven Training Data Debugging for Query 2.0

Abuzaid Firas; Agarwal Alekh; Boehm Matthias; Chapman Adriane; Gilpin Leilani H.; Giordano Ryan; Green Todd J.; Kang Daniel; Kantchelian Alex; Khanna Rajiv; Koh Pang Wei; Konda Pradap; Krishnan Sanjay; Li Yuliang; Matthew; Metsis Vangelis; Rahm Erhard; Ribeiro Marco Túlio; Ré Christopher; Shrikumar Avanti; Sundararajan Mukund; Tanaka Daiki; Xu Jingyi; Zhang Xuezhou

Complaint-driven Training Data Debugging for Query 2.0

Authors: Abuzaid Firas
Agarwal Alekh
Boehm Matthias
Chapman Adriane
Gilpin Leilani H.
Giordano Ryan
Green Todd J.
Kang Daniel
Kantchelian Alex
Khanna Rajiv
Koh Pang Wei
Konda Pradap
Krishnan Sanjay
Li Yuliang
Matthew
Metsis Vangelis
Rahm Erhard
Ribeiro Marco Túlio
Ré Christopher
Shrikumar Avanti
Sundararajan Mukund
Tanaka Daiki
Xu Jingyi
Zhang Xuezhou
Publication date: 12 April 2020
Publisher: 'Association for Computing Machinery (ACM)'
Doi

Abstract

As the need for machine learning (ML) increases rapidly across all industry sectors, there is a significant interest among commercial database providers to support "Query 2.0", which integrates model inference into SQL queries. Debugging Query 2.0 is very challenging since an unexpected query result may be caused by the bugs in training data (e.g., wrong labels, corrupted features). In response, we propose Rain, a complaint-driven training data debugging system. Rain allows users to specify complaints over the query's intermediate or final output, and aims to return a minimum set of training examples so that if they were removed, the complaints would be resolved. To the best of our knowledge, we are the first to study this problem. A naive solution requires retraining an exponential number of ML models. We propose two novel heuristic approaches based on influence functions which both require linear retraining steps. We provide an in-depth analytical and empirical analysis of the two approaches and conduct extensive experiments to evaluate their effectiveness using four real-world datasets. Results show that Rain achieves the highest recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Dat

Similar works

Full text

Available Versions

Crossref

Last time updated on 10/08/2021