11 research outputs found
Estimating Cardinalities with Deep Sketches
We introduce Deep Sketches, which are compact models of databases that allow
us to estimate the result sizes of SQL queries. Deep Sketches are powered by a
new deep learning approach to cardinality estimation that can capture
correlations between columns, even across tables. Our demonstration allows
users to define such sketches on the TPC-H and IMDb datasets, monitor the
training process, and run ad-hoc queries against trained sketches. We also
estimate query cardinalities with HyPer and PostgreSQL to visualize the gains
over traditional cardinality estimators.Comment: To appear in SIGMOD'1
Estimating cardinalities with deep sketches
We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run ad-hoc queries against trained sketches. We also estimate query cardinalities with HyPer and PostgreSQL to visualize the gains over traditional cardinality estimators
Flow-Loss: Learning Cardinality Estimates That Matter
Previous approaches to learned cardinality estimation have focused on
improving average estimation error, but not all estimates matter equally. Since
learned models inevitably make mistakes, the goal should be to improve the
estimates that make the biggest difference to an optimizer. We introduce a new
loss function, Flow-Loss, that explicitly optimizes for better query plans by
approximating the optimizer's cost model and dynamic programming search
algorithm with analytical functions. At the heart of Flow-Loss is a reduction
of query optimization to a flow routing problem on a certain plan graph in
which paths correspond to different query plans. To evaluate our approach, we
introduce the Cardinality Estimation Benchmark, which contains the ground truth
cardinalities for sub-plans of over 16K queries from 21 templates with up to 15
joins. We show that across different architectures and databases, a model
trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost
model) and query runtimes despite having worse estimation accuracy than a model
trained with Q-Error. When the test set queries closely match the training
queries, both models improve performance significantly over PostgreSQL and are
close to the optimal performance (using true cardinalities). However, the
Q-Error trained model degrades significantly when evaluated on queries that are
slightly different (e.g., similar but not identical query templates), while the
Flow-Loss trained model generalizes better to such situations. For example, the
Flow-Loss model achieves up to 1.5x better runtimes on unseen templates
compared to the Q-Error model, despite leveraging the same model architecture
and training data
Weiterentwicklung analytischer Datenbanksysteme
This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens