17 research outputs found
Scalable aggregation predictive analytics: a query-driven machine learning approach
We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method
PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries
Range aggregate queries find frequent application in data analytics. In some
use cases, approximate results are preferred over accurate results if they can
be computed rapidly and satisfy approximation guarantees. Inspired by a recent
indexing approach, we provide means of representing a discrete point data set
by continuous functions that can then serve as compact index structures. More
specifically, we develop a polynomial-based indexing approach, called PolyFit,
for processing approximate range aggregate queries. PolyFit is capable of
supporting multiple types of range aggregate queries, including COUNT, SUM, MIN
and MAX aggregates, with guaranteed absolute and relative error bounds.
Experiment results show that PolyFit is faster and more accurate and compact
than existing learned index structures.Comment: 13 page
Predictive intelligence of reliable analytics in distributed computing environments
Lack of knowledge in the underlying data distribution in distributed large-scale data can be an obstacle when issuing analytics & predictive modelling queries. Analysts find themselves having a hard time finding analytics/exploration queries that satisfy their needs. In this paper, we study how exploration query results can be predicted in order to avoid the execution of ‘bad’/non-informative queries that waste network, storage, financial resources, and time in a distributed computing environment. The proposed methodology involves clustering of a training set of exploration queries along with the cardinality of the results (score) they retrieved and then using query-centroid representatives to proceed with predictions. After the training phase, we propose a novel refinement process to increase the reliability of predicting the score of new unseen queries based on the refined query representatives. Comprehensive experimentation with real datasets shows that more reliable predictions are acquired after the proposed refinement method, which increases the reliability of the closest centroid and improves predictability under the right circumstances
SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning
SkinnerDB is designed from the ground up for reliable join ordering. It
maintains no data statistics and uses no cost or cardinality models. Instead,
it uses reinforcement learning to learn optimal join orders on the fly, during
the execution of the current query. To that purpose, we divide the execution of
a query into many small time slices. Different join orders are tried in
different time slices. We merge result tuples generated according to different
join orders until a complete result is obtained. By measuring execution
progress per time slice, we identify promising join orders as execution
proceeds.
Along with SkinnerDB, we introduce a new quality criterion for query
execution strategies. We compare expected execution cost against execution cost
for an optimal join order. SkinnerDB features multiple execution strategies
that are optimized for that criterion. Some of them can be executed on top of
existing database systems. For maximal performance, we introduce a customized
execution engine, facilitating fast join order switching via specialized
multi-way join algorithms and tuple representations.
We experimentally compare SkinnerDB's performance against various baselines,
including MonetDB, Postgres, and adaptive processing methods. We consider
various benchmarks, including the join order benchmark and TPC-H variants with
user-defined functions. Overall, the overheads of reliable join ordering are
negligible compared to the performance impact of the occasional, catastrophic
join order choice
Accurate sampling-based cardinality estimation for complex graph queries
Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in
database systems. This problem is particularly difficult in graph databases, where queries often involve a large
number of joins and self-joins. Recently, Park et al. [54] surveyed seven state-of-the-art cardinality estimation
approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method
based on the WanderJoin online aggregation algorithm [46] consistently offers superior accuracy.
We extended the framework by Park et al. [54] with three additional datasets and repeated their experiments.
Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples
and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails
to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond
simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference,
and duplicate elimination. Neither of the methods considered by Park et al. [54] is applicable to such queries.
In this paper we present a novel approach for estimating the cardinality of complex graph queries. Our
approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with
arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates
converges with probability one to the actual cardinality. We present optimisations of the basic algorithm
that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that
our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to
integrate our approach into a simple dynamic programming query planner, and we confirm empirically that
our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times