6 research outputs found

    Aggregate Query Prediction under Dynamic Workloads

    Get PDF
    Large organizations have seamlessly incorporated data-driven decision making in their operations. However, as data volumes increase, expensive big data infrastructures are called to rescue. In this setting, analytics tasks become very costly in terms of query response time, resource consumption, and money in cloud deployments, especially when base data are stored across geographically distributed data centers. Therefore, we introduce an adaptive Machine Learning mechanism which is light-weight, stored client-side, can estimate the answers of a variety of aggregate queries and can avoid the big data backend. The estimations are performed in milliseconds and are inexepensive as the mechanism learns from past analytical-query patterns. However, as analytic queries are ad-hoc and analysts’ interests change over time we develop solutions that can swiftly and accurately detect such changes and adapt to new query patterns. The capabilities of our approach are demonstrated using extensive evaluation with real and synthetic datasets

    Explaining Aggregates for Exploratory Analytics

    Get PDF
    Analysts wishing to explore multivariate data spaces, typically pose queries involving selection operators, i.e., range or radius queries, which define data subspaces of possible interest and then use aggregation functions, the results of which determine their exploratory analytics interests. However, such aggregate query (AQ) results are simple scalars and as such, convey limited information about the queried subspaces for exploratory analysis.We address this shortcoming aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism coined XAXA: eXplaining Aggregates for eXploratory Analytics. XAXA’s novel AQ explanations are represented using functions obtained by a three-fold joint optimization problem. Explanations assume the form of a set of parametric piecewise-linear functions acquired through a statistical learning model. A key feature of the proposed solution is that model training is performed by only monitoring AQs and their answers on-line. In XAXA, explanations for future AQs can be computed without any database (DB) access and can be used to further explore the queried data subspaces, without issuing any more queries to the DB. We evaluate the explanation accuracy and efficiency of XAXA through theoretically grounded metrics over real-world and synthetic datasets and query workloads

    Query-driven learning for predictive analytics of data subspace cardinality

    Get PDF
    Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

    Large-scale predictive modeling and analytics through regression queries in data management systems

    Get PDF
    Regression analytics has been the standard approach to modeling the relationship between input and output variables, while recent trends aim to incorporate advanced regression analytics capabilities within data management systems (DMS). Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute with a novel predictive analytics model and an associated statistical learning methodology, which are efficient, scalable and accurate in discovering piecewise linear dependencies among variables by observing only regression queries and their answers issued to a DMS. We focus on in-DMS piecewise linear regression and specifically in predicting the answers to mean-value aggregate queries, identifying and delivering the piecewise linear dependencies between variables to regression queries and predicting the data dependent variables within specific data subspaces defined by analysts and data scientists. Our goal is to discover a piecewise linear data function approximation over the underlying data only through query–answer pairs that is competitive with the best piecewise linear approximation to the ground truth. Our methodology is analyzed, evaluated and compared with exact solution and near-perfect approximations of the underlying relationships among variables achieving orders of magnitude improvement in analytics processing

    Large-scale data exploration using explanatory regression functions

    Get PDF
    Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of which determine a subspace’s interestingness for further exploration and deeper analysis. However, Aggregate Query (AQ) results are scalars and convey limited information and explainability about the queried subspaces for enhanced exploratory analysis. Analysts have no way of identifying how these results are derived or how they change w.r.t query (input) parameter values. We address this shortcoming by aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism based on machine learning. We explain AQ results using functions obtained by a three-fold joint optimization problem which assume the form of explainable piecewise-linear regression functions. A key feature of the proposed solution is that the explanation functions are estimated using past executed queries. These queries provide a coarse grained overview of the underlying aggregate function (generating the AQ results) to be learned. Explanations for future, previously unseen AQs can be computed without accessing the underlying data and can be used to further explore the queried data subspaces, without issuing more queries to the backend analytics engine. We evaluate the explanation accuracy and efficiency through theoretically grounded metrics over real-world and synthetic datasets and query workloads
    corecore