32 research outputs found
Buffer Pool Aware Query Scheduling via Deep Reinforcement Learning
In this extended abstract, we propose a new technique for query scheduling
with the explicit goal of reducing disk reads and thus implicitly increasing
query performance. We introduce \system, a learned scheduler that leverages
overlapping data reads among incoming queries and learns a scheduling strategy
that improves cache hits. \system relies on deep reinforcement learning to
produce workload-specific scheduling strategies that focus on long-term
performance benefits while being adaptive to previously-unseen data access
patterns. We present results from a proof-of-concept prototype, demonstrating
that learned schedulers can offer significant performance improvements over
hand-crafted scheduling heuristics. Ultimately, we make the case that this is a
promising research direction in the intersection of machine learning and
databases
Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)
Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character\u27s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%
BitE : Accelerating Learned Query Optimization in a Mixed-Workload Environment
Although the many efforts to apply deep reinforcement learning to query
optimization in recent years, there remains room for improvement as query
optimizers are complex entities that require hand-designed tuning of workloads
and datasets. Recent research present learned query optimizations results
mostly in bulks of single workloads which focus on picking up the unique traits
of the specific workload. This proves to be problematic in scenarios where the
different characteristics of multiple workloads and datasets are to be mixed
and learned together. Henceforth, in this paper, we propose BitE, a novel
ensemble learning model using database statistics and metadata to tune a
learned query optimizer for enhancing performance. On the way, we introduce
multiple revisions to solve several challenges: we extend the search space for
the optimal Abstract SQL Plan(represented as a JSON object called ASP) by
expanding hintsets, we steer the model away from the default plans that may be
biased by configuring the experience with all unique plans of queries, and we
deviate from the traditional loss functions and choose an alternative method to
cope with underestimation and overestimation of reward. Our model achieves
19.6% more improved queries and 15.8% less regressed queries compared to the
existing traditional methods whilst using a comparable level of resources.Comment: This work was done when the first three author were interns in SAP
Labs Korea and they have equal contributio
COOOL: A Learning-To-Rank Approach for SQL Hint Recommendations
Query optimization is a pivotal part of every database management system
(DBMS) since it determines the efficiency of query execution. Numerous works
have introduced Machine Learning (ML) techniques to cost modeling, cardinality
estimation, and end-to-end learned optimizer, but few of them are proven
practical due to long training time, lack of interpretability, and integration
cost. A recent study provides a practical method to optimize queries by
recommending per-query hints but it suffers from two inherited problems. First,
it follows the regression framework to predict the absolute latency of each
query plan, which is very challenging because the latencies of query plans for
a certain query may span multiple orders of magnitude. Second, it requires
training a model for each dataset, which restricts the application of the
trained models in practice. In this paper, we propose COOOL to predict Cost
Orders of query plans to cOOperate with DBMS by Learning-To-Rank. Instead of
estimating absolute costs, COOOL uses ranking-based approaches to compute
relative ranking scores of the costs of query plans. We show that COOOL is
theoretically valid to distinguish query plans with different latencies. We
implement COOOL on PostgreSQL, and extensive experiments on
join-order-benchmark and TPC-H data demonstrate that COOOL outperforms
PostgreSQL and state-of-the-art methods on single-dataset tasks as well as a
unified model for multiple-dataset tasks. Our experiments also shed some light
on why COOOL outperforms regression approaches from the representation learning
perspective, which may guide future research
Can Deep Neural Networks Predict Data Correlations from Column Names?
For humans, it is often possible to predict data correlations from column
names. We conduct experiments to find out whether deep neural networks can
learn to do the same. If so, e.g., it would open up the possibility of tuning
tools that use NLP analysis on schema elements to prioritize their efforts for
correlation detection.
We analyze correlations for around 120,000 column pairs, taken from around
4,000 data sets. We try to predict correlations, based on column names alone.
For predictions, we exploit pre-trained language models, based on the recently
proposed Transformer architecture. We consider different types of correlations,
multiple prediction methods, and various prediction scenarios. We study the
impact of factors such as column name length or the amount of training data on
prediction accuracy. Altogether, we find that deep neural networks can predict
correlations with a relatively high accuracy in many scenarios (e.g., with an
accuracy of 95% for long column names)