Data cleaning is a time-consuming process that depends on the data analysis
that users perform. Existing solutions treat data cleaning as a separate
offline process that takes place before analysis begins. Applying data cleaning
before analysis assumes a priori knowledge of the inconsistencies and the query
workload, thereby requiring effort on understanding and cleaning the data that
is unnecessary for the analysis. We propose an approach that performs
probabilistic repair of denial constraint violations on-demand, driven by the
exploratory analysis that users perform. We introduce Daisy, a system that
seamlessly integrates data cleaning into the analysis by relaxing query
results. Daisy executes analytical query-workloads over dirty data by weaving
cleaning operators into the query plan. Our evaluation shows that Daisy adapts
to the workload and outperforms traditional offline cleaning on both synthetic
and real-world workloads.Comment: To appear in SIGMOD 2020 proceeding