Data is a central component of machine learning and causal inference tasks.
The availability of large amounts of data from sources such as open data
repositories, data lakes and data marketplaces creates an opportunity to
augment data and boost those tasks' performance. However, augmentation
techniques rely on a user manually discovering and shortlisting useful
candidate augmentations. Existing solutions do not leverage the synergy between
discovery and augmentation, thus under exploiting data.
In this paper, we introduce METAM, a novel goal-oriented framework that
queries the downstream task with a candidate dataset, forming a feedback loop
that automatically steers the discovery and augmentation process. To select
candidates efficiently, METAM leverages properties of the: i) data, ii) utility
function, and iii) solution set size. We show METAM's theoretical guarantees
and demonstrate those empirically on a broad set of tasks. All in all, we
demonstrate the promise of goal-oriented data discovery to modern data science
applications.Comment: ICDE 2023 pape