The typical approach for learned DBMS components is to capture the behavior
by running a representative set of queries and use the observations to train a
machine learning model. This workload-driven approach, however, has two major
downsides. First, collecting the training data can be very expensive, since all
queries need to be executed on potentially large databases. Second, training
data has to be recollected when the workload and the data changes. To overcome
these limitations, we take a different route: we propose to learn a pure
data-driven model that can be used for different tasks such as query answering
or cardinality estimation. This data-driven model also supports ad-hoc queries
and updates of the data without the need of full retraining when the workload
or data changes. Indeed, one may now expect that this comes at a price of lower
accuracy since workload-driven models can make use of more information.
However, this is not the case. The results of our empirical evaluation
demonstrate that our data-driven approach not only provides better accuracy
than state-of-the-art learned components but also generalizes better to unseen
queries