40,213 research outputs found

    Sampling-Based Query Re-Optimization

    Full text link
    Despite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-processing step that can take a plan produced by the optimizer, detect when it is likely to have made such a mistake, and take steps to fix it. Specifically, our solution is a sampling-based iterative procedure that requires almost no changes to the original query optimizer or query evaluation mechanism of the system. We show that this indeed imposes low overhead and catches cases where three widely used optimizers (PostgreSQL and two commercial systems) make large errors.Comment: This is the extended version of a paper with the same title and authors that appears in the Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2016

    Database Learning: Toward a Database that Becomes Smarter Every Time

    Full text link
    In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM SIGMOD conference 201

    Learned Cardinalities: Estimating Correlated Joins with Deep Learning

    Get PDF
    We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization.Comment: CIDR 2019. https://github.com/andreaskipf/learnedcardinalitie

    Simultaneous Estimation of Photometric Redshifts and SED Parameters: Improved Techniques and a Realistic Error Budget

    Full text link
    We seek to improve the accuracy of joint galaxy photometric redshift estimation and spectral energy distribution (SED) fitting. By simulating different sources of uncorrected systematic errors, we demonstrate that if the uncertainties on the photometric redshifts are estimated correctly, so are those on the other SED fitting parameters, such as stellar mass, stellar age, and dust reddening. Furthermore, we find that if the redshift uncertainties are over(under)-estimated, the uncertainties in SED parameters tend to be over(under)-estimated by similar amounts. These results hold even in the presence of severe systematics and provide, for the first time, a mechanism to validate the uncertainties on these parameters via comparison with spectroscopic redshifts. We propose a new technique (annealing) to re-calibrate the joint uncertainties in the photo-z and SED fitting parameters without compromising the performance of the SED fitting + photo-z estimation. This procedure provides a consistent estimation of the multidimensional probability distribution function in SED fitting + z parameter space, including all correlations. While the performance of joint SED fitting and photo-z estimation might be hindered by template incompleteness, we demonstrate that the latter is "flagged" by a large fraction of outliers in redshift, and that significant improvements can be achieved by using flexible stellar populations synthesis models and more realistic star formation histories. In all cases, we find that the median stellar age is better recovered than the time elapsed from the onset of star formation [abridged].Comment: 11 pages, 5 figures, 3 tables. Accepted for publication in the Astrophysical Journa
    corecore