3,103 research outputs found
Neo: A Learned Query Optimizer
Query optimization is one of the most challenging problems in database
systems. Despite the progress made over the past decades, query optimizers
remain extremely complex components that require a great deal of hand-tuning
for specific workloads and datasets. Motivated by this shortcoming and inspired
by recent advances in applying machine learning to data management challenges,
we introduce Neo (Neural Optimizer), a novel learning-based query optimizer
that relies on deep neural networks to generate query executions plans. Neo
bootstraps its query optimization model from existing optimizers and continues
to learn from incoming queries, building upon its successes and learning from
its failures. Furthermore, Neo naturally adapts to underlying data patterns and
is robust to estimation errors. Experimental results demonstrate that Neo, even
when bootstrapped from a simple optimizer like PostgreSQL, can learn a model
that offers similar performance to state-of-the-art commercial optimizers, and
in some cases even surpass them
Deep Reinforcement Learning for Join Order Enumeration
Join order selection plays a significant role in query performance. However,
modern query optimizers typically employ static join enumeration algorithms
that do not receive any feedback about the quality of the resulting plan.
Hence, optimizers often repeatedly choose the same bad plan, as they do not
have a mechanism for "learning from their mistakes". In this paper, we argue
that existing deep reinforcement learning techniques can be applied to address
this challenge. These techniques, powered by artificial neural networks, can
automatically improve decision making by incorporating feedback from their
successes and failures. Towards this goal, we present ReJOIN, a
proof-of-concept join enumerator, and present preliminary results indicating
that ReJOIN can match or outperform the PostgreSQL optimizer in terms of plan
quality and join enumeration efficiency
Estimating Cardinalities with Deep Sketches
We introduce Deep Sketches, which are compact models of databases that allow
us to estimate the result sizes of SQL queries. Deep Sketches are powered by a
new deep learning approach to cardinality estimation that can capture
correlations between columns, even across tables. Our demonstration allows
users to define such sketches on the TPC-H and IMDb datasets, monitor the
training process, and run ad-hoc queries against trained sketches. We also
estimate query cardinalities with HyPer and PostgreSQL to visualize the gains
over traditional cardinality estimators.Comment: To appear in SIGMOD'1
Learned Cardinalities: Estimating Correlated Joins with Deep Learning
We describe a new deep learning approach to cardinality estimation. MSCN is a
multi-set convolutional network, tailored to representing relational query
plans, that employs set semantics to capture query features and true
cardinalities. MSCN builds on sampling-based estimation, addressing its
weaknesses when no sampled tuples qualify a predicate, and in capturing
join-crossing correlations. Our evaluation of MSCN using a real-world dataset
shows that deep learning significantly enhances the quality of cardinality
estimation, which is the core problem in query optimization.Comment: CIDR 2019. https://github.com/andreaskipf/learnedcardinalitie
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
- …