5 research outputs found
Cardinality estimation in ETL processes
The cardinality estimation in ETL processes is particularly difficult. Aside from the well-known SQL operators, which are also used in ETL processes, there are a variety of operators without exact counterparts in the relational world. In addition to those, we find operators that support very specific data integration aspects. For such operators, there are no well-examined statistic approaches for cardinality estimations. Therefore, we propose a black-box approach and estimate the cardinality using a set of statistic models for each operator. We discuss different model granularities and develop an adaptive cardinality estimation framework for ETL processes. We map the abstract model operators to specific statistic learning approaches (regression, decision trees, support vector machines, etc.) and evaluate our cardinality estimations in an extensive experimental study
Flow-Loss: Learning Cardinality Estimates That Matter
Previous approaches to learned cardinality estimation have focused on
improving average estimation error, but not all estimates matter equally. Since
learned models inevitably make mistakes, the goal should be to improve the
estimates that make the biggest difference to an optimizer. We introduce a new
loss function, Flow-Loss, that explicitly optimizes for better query plans by
approximating the optimizer's cost model and dynamic programming search
algorithm with analytical functions. At the heart of Flow-Loss is a reduction
of query optimization to a flow routing problem on a certain plan graph in
which paths correspond to different query plans. To evaluate our approach, we
introduce the Cardinality Estimation Benchmark, which contains the ground truth
cardinalities for sub-plans of over 16K queries from 21 templates with up to 15
joins. We show that across different architectures and databases, a model
trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost
model) and query runtimes despite having worse estimation accuracy than a model
trained with Q-Error. When the test set queries closely match the training
queries, both models improve performance significantly over PostgreSQL and are
close to the optimal performance (using true cardinalities). However, the
Q-Error trained model degrades significantly when evaluated on queries that are
slightly different (e.g., similar but not identical query templates), while the
Flow-Loss trained model generalizes better to such situations. For example, the
Flow-Loss model achieves up to 1.5x better runtimes on unseen templates
compared to the Q-Error model, despite leveraging the same model architecture
and training data