11 research outputs found

    Learned Cardinalities: Estimating Correlated Joins with Deep Learning

    Get PDF
    We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization.Comment: CIDR 2019. https://github.com/andreaskipf/learnedcardinalitie

    Learned cardinalities: Estimating correlated joins with deep learning

    Get PDF
    We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning signiicantly enhances the quality of cardinality estimation, which is the core problem in query optimization

    An experimental study of learned cardinality estimation

    Get PDF
    Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this thesis, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they can- not catch up with fast data updates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems

    DeepDB: Learn from Data, not from Queries!

    Get PDF
    The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. This workload-driven approach, however, has two major downsides. First, collecting the training data can be very expensive, since all queries need to be executed on potentially large databases. Second, training data has to be recollected when the workload and the data changes. To overcome these limitations, we take a different route: we propose to learn a pure data-driven model that can be used for different tasks such as query answering or cardinality estimation. This data-driven model also supports ad-hoc queries and updates of the data without the need of full retraining when the workload or data changes. Indeed, one may now expect that this comes at a price of lower accuracy since workload-driven models can make use of more information. However, this is not the case. The results of our empirical evaluation demonstrate that our data-driven approach not only provides better accuracy than state-of-the-art learned components but also generalizes better to unseen queries

    A query processing system for very large spatial databases using a new map algebra

    Get PDF
    Dans cette thĂšse nous introduisons une approche de traitement de requĂȘtes pour des bases de donnĂ©e spatiales. Nous expliquons aussi les concepts principaux que nous avons dĂ©fini et dĂ©veloppĂ©: une algĂšbre spatiale et une approche Ă  base de graphe utilisĂ©e dans l'optimisateur. L'algĂšbre spatiale est dĂ©fini pour exprimer les requĂȘtes et les rĂšgles de transformation pendant les diffĂ©rentes Ă©tapes de l'optimisation de requĂȘtes. Nous avons essayĂ© de dĂ©finir l'algĂšbre la plus complĂšte que possible pour couvrir une grande variĂ©tĂ© d'application. L'opĂ©rateur algĂ©brique reçoit et produit seulement des carte. Les fonctions reçoivent des cartes et produisent des scalaires ou des objets. L'optimisateur reçoit la requĂȘte en expression algĂ©brique et produit un QEP (Query Evaluation Plan) efficace dans deux Ă©tapes: gĂ©nĂ©ration de QEG (Query Evaluation Graph) et gĂ©nĂ©ration de QEP. Dans premiĂšre Ă©tape un graphe (QEG) Ă©quivalent de l'expression algĂ©brique est produit. Les rĂšgles de transformation sont utilisĂ©es pour transformer le graphe a un Ă©quivalent plus efficace. Dans deuxiĂšme Ă©tape un QEP est produit de QEG passĂ© de l'Ă©tape prĂ©cĂ©dente. Le QEP est un ensemble des opĂ©rations primitives consĂ©cutives qui produit les rĂ©sultats finals (la rĂ©ponse finale de la requĂȘte soumise au base de donnĂ©e). Nous avons implĂ©mentĂ© l'optimisateur, un gĂ©nĂ©rateur de requĂȘte spatiale alĂ©atoire, et une base de donnĂ©e simulĂ©e. La base de donnĂ©e spatiale simulĂ©e est un ensemble de fonctions pour simuler des opĂ©rations spatiales primitives. Les requĂȘtes alĂ©atoires sont soumis Ă  l'optimisateur. Les QEPs gĂ©nĂ©rĂ©es sont soumis au simulateur de base de donnĂ©es spatiale. Les rĂ©sultats expĂ©rimentaux sont utilisĂ©s pour discuter les performances et les caractĂ©ristiques de l'optimisateur.Abstract: In this thesis we introduce a query processing approach for spatial databases and explain the main concepts we defined and developed: a spatial algebra and a graph based approach used in the optimizer. The spatial algebra was defined to express queries and transformation rules during different steps of the query optimization. To cover a vast variety of potential applications, we tried to define the algebra as complete as possible. The algebra looks at the spatial data as maps of spatial objects. The algebraic operators act on the maps and result in new maps. Aggregate functions can act on maps and objects and produce objects or basic values (characters, numbers, etc.). The optimizer receives the query in algebraic expression and produces one efficient QEP (Query Evaluation Plan) through two main consecutive blocks: QEG (Query Evaluation Graph) generation and QEP generation. In QEG generation we construct a graph equivalent of the algebraic expression and then apply graph transformation rules to produce one efficient QEG. In QEP generation we receive the efficient QEG and do predicate ordering and approximation and then generate the efficient QEP. The QEP is a set of consecutive phases that must be executed in the specified order. Each phase consist of one or more primitive operations. All primitive operations that are in the same phase can be executed in parallel. We implemented the optimizer, a randomly spatial query generator and a simulated spatial database. The query generator produces random queries for the purpose of testing the optimizer. The simulated spatial database is a set of functions to simulate primitive spatial operations. They return the cost of the corresponding primitive operation according to input parameters. We put randomly generated queries to the optimizer, got the generated QEPs and put them to the spatial database simulator. We used the experimental results to discuss on the optimizer characteristics and performance. The optimizer was designed for databases with a very large number of spatial objects nevertheless most of the concepts we used can be applied to all spatial information systems."--RĂ©sumĂ© abrĂ©gĂ© par UMI
    corecore