15 research outputs found
Flow-Loss: Learning Cardinality Estimates That Matter
Previous approaches to learned cardinality estimation have focused on
improving average estimation error, but not all estimates matter equally. Since
learned models inevitably make mistakes, the goal should be to improve the
estimates that make the biggest difference to an optimizer. We introduce a new
loss function, Flow-Loss, that explicitly optimizes for better query plans by
approximating the optimizer's cost model and dynamic programming search
algorithm with analytical functions. At the heart of Flow-Loss is a reduction
of query optimization to a flow routing problem on a certain plan graph in
which paths correspond to different query plans. To evaluate our approach, we
introduce the Cardinality Estimation Benchmark, which contains the ground truth
cardinalities for sub-plans of over 16K queries from 21 templates with up to 15
joins. We show that across different architectures and databases, a model
trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost
model) and query runtimes despite having worse estimation accuracy than a model
trained with Q-Error. When the test set queries closely match the training
queries, both models improve performance significantly over PostgreSQL and are
close to the optimal performance (using true cardinalities). However, the
Q-Error trained model degrades significantly when evaluated on queries that are
slightly different (e.g., similar but not identical query templates), while the
Flow-Loss trained model generalizes better to such situations. For example, the
Flow-Loss model achieves up to 1.5x better runtimes on unseen templates
compared to the Q-Error model, despite leveraging the same model architecture
and training data
Bao: Learning to Steer Query Optimizers
Query optimization remains one of the most challenging problems in data
management systems. Recent efforts to apply machine learning techniques to
query optimization challenges have been promising, but have shown few practical
gains due to substantive training overhead, inability to adapt to changes, and
poor tail performance. Motivated by these difficulties and drawing upon a long
history of research in multi-armed bandits, we introduce Bao (the BAndit
Optimizer). Bao takes advantage of the wisdom built into existing query
optimizers by providing per-query optimization hints. Bao combines modern tree
convolutional neural networks with Thompson sampling, a decades-old and
well-studied reinforcement learning algorithm. As a result, Bao automatically
learns from its mistakes and adapts to changes in query workloads, data, and
schema. Experimentally, we demonstrate that Bao can quickly (an order of
magnitude faster than previous approaches) learn strategies that improve
end-to-end query execution performance, including tail latency. In cloud
environments, we show that Bao can offer both reduced costs and better
performance compared with a sophisticated commercial system
DeepDB: Learn from Data, not from Queries!
The typical approach for learned DBMS components is to capture the behavior
by running a representative set of queries and use the observations to train a
machine learning model. This workload-driven approach, however, has two major
downsides. First, collecting the training data can be very expensive, since all
queries need to be executed on potentially large databases. Second, training
data has to be recollected when the workload and the data changes. To overcome
these limitations, we take a different route: we propose to learn a pure
data-driven model that can be used for different tasks such as query answering
or cardinality estimation. This data-driven model also supports ad-hoc queries
and updates of the data without the need of full retraining when the workload
or data changes. Indeed, one may now expect that this comes at a price of lower
accuracy since workload-driven models can make use of more information.
However, this is not the case. The results of our empirical evaluation
demonstrate that our data-driven approach not only provides better accuracy
than state-of-the-art learned components but also generalizes better to unseen
queries