13 research outputs found
Robust Query Optimization Methods With Respect to Estimation Errors: A Survey
International audienceThe quality of a query execution plan chosen by a Cost-Based Optimizer (CBO) depends greatly on the estimation accuracy of input parameter values. Many research results have been produced on improving the estimation accuracy, but they do not work for every situation. Therefore, "robust query optimization" was introduced, in an effort to minimize the sub-optimality risk by accepting the fact that estimates could be inaccurate. In this survey, we aim to provide an overview of robust query optimization methods by classifying them into different categories, explaining the essential ideas, listing their advantages and limitations, and comparing them with multiple criteria
Can Deep Neural Networks Predict Data Correlations from Column Names?
For humans, it is often possible to predict data correlations from column
names. We conduct experiments to find out whether deep neural networks can
learn to do the same. If so, e.g., it would open up the possibility of tuning
tools that use NLP analysis on schema elements to prioritize their efforts for
correlation detection.
We analyze correlations for around 120,000 column pairs, taken from around
4,000 data sets. We try to predict correlations, based on column names alone.
For predictions, we exploit pre-trained language models, based on the recently
proposed Transformer architecture. We consider different types of correlations,
multiple prediction methods, and various prediction scenarios. We study the
impact of factors such as column name length or the amount of training data on
prediction accuracy. Altogether, we find that deep neural networks can predict
correlations with a relatively high accuracy in many scenarios (e.g., with an
accuracy of 95% for long column names)
How Good Are Query Optimizers, Really?
Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisi
Sampling-Based Query Re-Optimization
Despite of decades of work, query optimizers still make mistakes on
"difficult" queries because of bad cardinality estimates, often due to the
interaction of multiple predicates and correlations in the data. In this paper,
we propose a low-cost post-processing step that can take a plan produced by the
optimizer, detect when it is likely to have made such a mistake, and take steps
to fix it. Specifically, our solution is a sampling-based iterative procedure
that requires almost no changes to the original query optimizer or query
evaluation mechanism of the system. We show that this indeed imposes low
overhead and catches cases where three widely used optimizers (PostgreSQL and
two commercial systems) make large errors.Comment: This is the extended version of a paper with the same title and
authors that appears in the Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD 2016
Learning Multi-dimensional Indexes
Scanning and filtering over multi-dimensional tables are key operations in
modern analytical database engines. To optimize the performance of these
operations, databases often create clustered indexes over a single dimension or
multi-dimensional indexes such as R-trees, or use complex sort orders (e.g.,
Z-ordering). However, these schemes are often hard to tune and their
performance is inconsistent across different datasets and queries. In this
paper, we introduce Flood, a multi-dimensional in-memory index that
automatically adapts itself to a particular dataset and workload by jointly
optimizing the index structure and data storage. Flood achieves up to three
orders of magnitude faster performance for range scans with predicates than
state-of-the-art multi-dimensional indexes or sort orders on real-world
datasets and workloads. Our work serves as a building block towards an
end-to-end learned database system