40,213 research outputs found
Sampling-Based Query Re-Optimization
Despite of decades of work, query optimizers still make mistakes on
"difficult" queries because of bad cardinality estimates, often due to the
interaction of multiple predicates and correlations in the data. In this paper,
we propose a low-cost post-processing step that can take a plan produced by the
optimizer, detect when it is likely to have made such a mistake, and take steps
to fix it. Specifically, our solution is a sampling-based iterative procedure
that requires almost no changes to the original query optimizer or query
evaluation mechanism of the system. We show that this indeed imposes low
overhead and catches cases where three widely used optimizers (PostgreSQL and
two commercial systems) make large errors.Comment: This is the extended version of a paper with the same title and
authors that appears in the Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD 2016
Database Learning: Toward a Database that Becomes Smarter Every Time
In today's databases, previous query answers rarely benefit answering future
queries. For the first time, to the best of our knowledge, we change this
paradigm in an approximate query processing (AQP) context. We make the
following observation: the answer to each query reveals some degree of
knowledge about the answer to another query because their answers stem from the
same underlying distribution that has produced the entire dataset. Exploiting
and refining this knowledge should allow us to answer queries more
analytically, rather than by reading enormous amounts of raw data. Also,
processing more queries should continuously enhance our knowledge of the
underlying distribution, and hence lead to increasingly faster response times
for future queries.
We call this novel idea---learning from past query answers---Database
Learning. We exploit the principle of maximum entropy to produce answers, which
are in expectation guaranteed to be more accurate than existing sample-based
approximations. Empowered by this idea, we build a query engine on top of Spark
SQL, called Verdict. We conduct extensive experiments on real-world query
traces from a large customer of a major database vendor. Our results
demonstrate that Verdict supports 73.7% of these queries, speeding them up by
up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM
SIGMOD conference 201
Learned Cardinalities: Estimating Correlated Joins with Deep Learning
We describe a new deep learning approach to cardinality estimation. MSCN is a
multi-set convolutional network, tailored to representing relational query
plans, that employs set semantics to capture query features and true
cardinalities. MSCN builds on sampling-based estimation, addressing its
weaknesses when no sampled tuples qualify a predicate, and in capturing
join-crossing correlations. Our evaluation of MSCN using a real-world dataset
shows that deep learning significantly enhances the quality of cardinality
estimation, which is the core problem in query optimization.Comment: CIDR 2019. https://github.com/andreaskipf/learnedcardinalitie
Simultaneous Estimation of Photometric Redshifts and SED Parameters: Improved Techniques and a Realistic Error Budget
We seek to improve the accuracy of joint galaxy photometric redshift
estimation and spectral energy distribution (SED) fitting. By simulating
different sources of uncorrected systematic errors, we demonstrate that if the
uncertainties on the photometric redshifts are estimated correctly, so are
those on the other SED fitting parameters, such as stellar mass, stellar age,
and dust reddening. Furthermore, we find that if the redshift uncertainties are
over(under)-estimated, the uncertainties in SED parameters tend to be
over(under)-estimated by similar amounts. These results hold even in the
presence of severe systematics and provide, for the first time, a mechanism to
validate the uncertainties on these parameters via comparison with
spectroscopic redshifts. We propose a new technique (annealing) to re-calibrate
the joint uncertainties in the photo-z and SED fitting parameters without
compromising the performance of the SED fitting + photo-z estimation. This
procedure provides a consistent estimation of the multidimensional probability
distribution function in SED fitting + z parameter space, including all
correlations. While the performance of joint SED fitting and photo-z estimation
might be hindered by template incompleteness, we demonstrate that the latter is
"flagged" by a large fraction of outliers in redshift, and that significant
improvements can be achieved by using flexible stellar populations synthesis
models and more realistic star formation histories. In all cases, we find that
the median stellar age is better recovered than the time elapsed from the onset
of star formation [abridged].Comment: 11 pages, 5 figures, 3 tables. Accepted for publication in the
Astrophysical Journa
- …