20 research outputs found
An experimental study of learned cardinality estimation
Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this thesis, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they can- not catch up with fast data updates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems
Estimating cardinalities with deep sketches
We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run ad-hoc queries against trained sketches. We also estimate query cardinalities with HyPer and PostgreSQL to visualize the gains over traditional cardinality estimators
Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads
Filtering data based on predicates is one of the most fundamental operations
for any modern data warehouse. Techniques to accelerate the execution of filter
expressions include clustered indexes, specialized sort orders (e.g., Z-order),
multi-dimensional indexes, and, for high selectivity queries, secondary
indexes. However, these schemes are hard to tune and their performance is
inconsistent. Recent work on learned multi-dimensional indexes has introduced
the idea of automatically optimizing an index for a particular dataset and
workload. However, the performance of that work suffers in the presence of
correlated data and skewed query workloads, both of which are common in real
applications. In this paper, we introduce Tsunami, which addresses these
limitations to achieve up to 6X faster query performance and up to 8X smaller
index size than existing learned multi-dimensional indexes, in addition to up
to 11X faster query performance and 170X smaller index size than
optimally-tuned traditional indexes
Weiterentwicklung analytischer Datenbanksysteme
This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens
DeepDB: Learn from Data, not from Queries!
The typical approach for learned DBMS components is to capture the behavior
by running a representative set of queries and use the observations to train a
machine learning model. This workload-driven approach, however, has two major
downsides. First, collecting the training data can be very expensive, since all
queries need to be executed on potentially large databases. Second, training
data has to be recollected when the workload and the data changes. To overcome
these limitations, we take a different route: we propose to learn a pure
data-driven model that can be used for different tasks such as query answering
or cardinality estimation. This data-driven model also supports ad-hoc queries
and updates of the data without the need of full retraining when the workload
or data changes. Indeed, one may now expect that this comes at a price of lower
accuracy since workload-driven models can make use of more information.
However, this is not the case. The results of our empirical evaluation
demonstrate that our data-driven approach not only provides better accuracy
than state-of-the-art learned components but also generalizes better to unseen
queries
Bao: Learning to Steer Query Optimizers
Query optimization remains one of the most challenging problems in data
management systems. Recent efforts to apply machine learning techniques to
query optimization challenges have been promising, but have shown few practical
gains due to substantive training overhead, inability to adapt to changes, and
poor tail performance. Motivated by these difficulties and drawing upon a long
history of research in multi-armed bandits, we introduce Bao (the BAndit
Optimizer). Bao takes advantage of the wisdom built into existing query
optimizers by providing per-query optimization hints. Bao combines modern tree
convolutional neural networks with Thompson sampling, a decades-old and
well-studied reinforcement learning algorithm. As a result, Bao automatically
learns from its mistakes and adapts to changes in query workloads, data, and
schema. Experimentally, we demonstrate that Bao can quickly (an order of
magnitude faster than previous approaches) learn strategies that improve
end-to-end query execution performance, including tail latency. In cloud
environments, we show that Bao can offer both reduced costs and better
performance compared with a sophisticated commercial system