1,936 research outputs found

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    Robust Query Optimization Methods With Respect to Estimation Errors: A Survey

    Get PDF
    International audienceThe quality of a query execution plan chosen by a Cost-Based Optimizer (CBO) depends greatly on the estimation accuracy of input parameter values. Many research results have been produced on improving the estimation accuracy, but they do not work for every situation. Therefore, "robust query optimization" was introduced, in an effort to minimize the sub-optimality risk by accepting the fact that estimates could be inaccurate. In this survey, we aim to provide an overview of robust query optimization methods by classifying them into different categories, explaining the essential ideas, listing their advantages and limitations, and comparing them with multiple criteria

    Query-driven learning for predictive analytics of data subspace cardinality

    Get PDF
    Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

    FactorJoin: A New Cardinality Estimation Framework for Join Queries

    Full text link
    Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.Comment: Paper accepted by SIGMOD 202

    Robust and adaptive query processing in hybrid transactional/analytical database systems

    Get PDF
    The quality of query execution plans in database systems determines how fast a query can be processed. Conventional query optimization may still select sub-optimal or even bad query execution plans, due to errors in the cardinality estimation. In this work, we address limitations and unsolved problems of Robust and Adaptive Query Processing, with the goal of improving the detection and compensation of sub-optimal query execution plans. We demonstrate that existing heuristics cannot sufficiently characterize the intermediate result cardinalities, for which a given query execution plan remains optimal, and present an algorithm to calculate precise optimality ranges. The compensation of sub-optimal query execution plans is a complementary problem. We describe metrics to quantify the robustness of query execution plans with respect to cardinality estimations errors. In queries with cardinality estimation errors, our corresponding robust plan selection strategy chooses query execution plans, which are up to 3.49x faster, compared to the estimated cheapest plans. Furthermore, we present an adaptive query processor to compensate sub-optimal query execution plans. It collects true cardinalities of intermediate results at query execution time to re-optimize the currently running query. We show that the overall effort for re-optimizations and plan switches is similar to the initial optimization. Our adaptive query processor can execute queries up to 5.19x faster, compared to a conventional query processor.Die Qualität von Anfrageausführungsplänen in Datenbank Systemen bestimmt, wie schnell eine Anfrage verarbeitet werden kann. Aufgrund von Fehlern in der Kardinalitätsschätzung können konventionelle Anfrageoptimierer immer noch sub-optimale oder sogar schlechte Anfrageausführungsplänen auswählen. In dieser Arbeit behandeln wir Einschränkungen und ungelöste Probleme robuster und adaptiver Anfrageverarbeitung, um die Erkennung und den Ausgleich sub-optimaler Anfrageausführungspläne zu verbessern. Wir zeigen, dass bestehende Heuristiken nicht entscheiden können, für welche Kardinalitäten ein Anfrageausführungsplan optimal ist, und stellen einen Algorithmus vor, der präzise Optimalitätsbereiche berechnen kann. Der Ausgleich von sub-optimalen Anfrageausführungsplänen ist ein ergänzendes Problem. Wir beschreiben Metriken, welche die Robustheit von Anfrageausführungsplänen gegenüber Fehlern in der Kardinalitätsschätzung quantifizieren können. Unsere robuste Planauswahlstrategie, die auf Robustheitsmetriken aufbaut, kann Pläne finden, die bei Fehlern in der Kardinalitätsschätzung bis zu 3.49x schneller sind als die geschätzt günstigsten Pläne. Des Weiteren stellen wir einen adaptiven Anfrageverarbeiter vor, der sub-optimale Anfrageausführungspläne ausgleichen kann. Er erfasst die wahren Kardinalitäten von Zwischenergebnissen während der Anfrageausführung, um damit die aktuell laufende Anfrage zu re-optimieren. Wir zeigen, dass der gesamte Aufwand für Re-Optimierungen und Planänderungen einer initialen Optimierung entspricht. Unser adaptiver Anfrageverarbeiter kann Anfragen bis zu 5.19x schneller ausführen als ein konventioneller Anfrageverarbeiter

    How Good Are Query Optimizers, Really?

    Get PDF
    Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisi
    corecore