3,533 research outputs found
Sampling-Based Query Re-Optimization
Despite of decades of work, query optimizers still make mistakes on
"difficult" queries because of bad cardinality estimates, often due to the
interaction of multiple predicates and correlations in the data. In this paper,
we propose a low-cost post-processing step that can take a plan produced by the
optimizer, detect when it is likely to have made such a mistake, and take steps
to fix it. Specifically, our solution is a sampling-based iterative procedure
that requires almost no changes to the original query optimizer or query
evaluation mechanism of the system. We show that this indeed imposes low
overhead and catches cases where three widely used optimizers (PostgreSQL and
two commercial systems) make large errors.Comment: This is the extended version of a paper with the same title and
authors that appears in the Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD 2016
How Good Are Query Optimizers, Really?
Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisi
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
Accurate sampling-based cardinality estimation for complex graph queries
Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in
database systems. This problem is particularly difficult in graph databases, where queries often involve a large
number of joins and self-joins. Recently, Park et al. [54] surveyed seven state-of-the-art cardinality estimation
approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method
based on the WanderJoin online aggregation algorithm [46] consistently offers superior accuracy.
We extended the framework by Park et al. [54] with three additional datasets and repeated their experiments.
Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples
and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails
to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond
simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference,
and duplicate elimination. Neither of the methods considered by Park et al. [54] is applicable to such queries.
In this paper we present a novel approach for estimating the cardinality of complex graph queries. Our
approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with
arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates
converges with probability one to the actual cardinality. We present optimisations of the basic algorithm
that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that
our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to
integrate our approach into a simple dynamic programming query planner, and we confirm empirically that
our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times
- …