7,352 research outputs found
Sampling-Based Query Re-Optimization
Despite of decades of work, query optimizers still make mistakes on
"difficult" queries because of bad cardinality estimates, often due to the
interaction of multiple predicates and correlations in the data. In this paper,
we propose a low-cost post-processing step that can take a plan produced by the
optimizer, detect when it is likely to have made such a mistake, and take steps
to fix it. Specifically, our solution is a sampling-based iterative procedure
that requires almost no changes to the original query optimizer or query
evaluation mechanism of the system. We show that this indeed imposes low
overhead and catches cases where three widely used optimizers (PostgreSQL and
two commercial systems) make large errors.Comment: This is the extended version of a paper with the same title and
authors that appears in the Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD 2016
Evaluation of optimization techniques for aggregation
Aggregations are almost always done at the top of operator tree after all selections
and joins in a SQL query. But actually they can be done before joins and make later
joins much cheaper when used properly. Although some enumeration algorithms
considering eager aggregation are proposed, no sufficient evaluations are available
to guide the adoption of this technique in practice. And no evaluations are done
for real data sets and real queries with estimated cardinalities. That means it is not
known how eager aggregation performs in the real world.
In this thesis, a new estimation method for group by and join combining traditional
estimation method and index-based join sampling is proposed and evaluated.
Two enumeration algorithms considering eager aggregation are implemented and
compared in the context of estimated cardinality. We find that the new estimation
method works well with little overhead and that under certain conditions, eager
aggregation can dramatically accelerate queries
Statistical structures for internet-scale data management
Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability
A New Framework for Join Product Skew
Different types of data skew can result in load imbalance in the context of
parallel joins under the shared nothing architecture. We study one important
type of skew, join product skew (JPS). A static approach based on frequency
classes is proposed which takes for granted the data distribution of join
attribute values. It comes from the observation that the join selectivity can
be expressed as a sum of products of frequencies of the join attribute values.
As a consequence, an appropriate assignment of join sub-tasks, that takes into
consideration the magnitude of the frequency products can alleviate the join
product skew. Motivated by the aforementioned remark, we propose an algorithm,
called Handling Join Product Skew (HJPS), to handle join product skew
Learned Cardinalities: Estimating Correlated Joins with Deep Learning
We describe a new deep learning approach to cardinality estimation. MSCN is a
multi-set convolutional network, tailored to representing relational query
plans, that employs set semantics to capture query features and true
cardinalities. MSCN builds on sampling-based estimation, addressing its
weaknesses when no sampled tuples qualify a predicate, and in capturing
join-crossing correlations. Our evaluation of MSCN using a real-world dataset
shows that deep learning significantly enhances the quality of cardinality
estimation, which is the core problem in query optimization.Comment: CIDR 2019. https://github.com/andreaskipf/learnedcardinalitie
- …