Query Optimization for Dynamic Imputation

Abstract

© 2017 VLDB. Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples withmissing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our system, ImputeDB, incorporates imputation into a costbased query optimizer, performing necessary imputations onthefly for each query. This allows users to immediately explore their data, while the system picks the optimal placement of imputation operations. We evaluate this approach on three real-world survey-based datasets. Our experiments show that our query plans execute between 10 and 140 times faster than first imputing the base tables. Furthermore, we show that the query results from on-the-fly imputation differ from the traditional base-table imputation approach by 0-8%. Finally, we show that while dropping tuples with missing values that fail query constraints discards 6-78% of the data, on-the-fly imputation loses only 0-21%

    Similar works

    Full text

    thumbnail-image

    Available Versions