1,893 research outputs found
Histogram Domain Ordering for Path Selectivity Estimation
We aim to improve the accuracy of path selectivity estimation in graph databases by intelligently ordering the domain of a histogram used for estimation. This problem has not, to our knowledge, received adequate attention in the research community. We present a novel framework for the systematic study of path ordering strategies in histogram construction and use. In this framework, we introduce new ordering strategies which we experimentally demonstrate lead to significant improvement of the accuracy of path selectivity estimation over current strategies. These positive results highlight the fundamental role that domain ordering plays in the design of effective histograms for efficient and scalable graph query processing
Exploring run-time reduction in programming codes via query optimization and caching
Object oriented programming languages raised the level of abstraction by supporting the explicit first class query constructs in the programming codes. These query constructs allow programmers to express operations on collections more abstractly than relying on their realization in loops or through provided libraries. Join optimization techniques from the field of database technology support efficient realizations of such language constructs. However, the problem associated with the existing techniques such as query optimization in Java Query Language (JQL) incurs run time overhead. Besides the programming languages supporting first-class query constructs, the usage of annotations has also increased in the software engineering community recently. Annotations are a common means of providing metadata information to the source code. The object oriented programming languages such as C# provides attributes constraints and Java has its own annotation constructs that allow the developers to include the metadata information in the program codes. This work introduces a series of query optimization approaches to reduce the run time of the programs involving explicit queries over collections. The proposed approaches rely on histograms to estimate the selectivity of the predicates and the joins in order to construct the query plans. The annotations in the source code are also utilized to gather the metadata required for the selectivity estimation of the numerical as well as the string valued predicates and joins in the queries. Several cache heuristics are proposed that effectively cache the results of repeated queries in the program codes. The cached query results are incrementally maintained up-to-date after the update operations to the collections --Abstract, page iv
Sampling-Based Query Re-Optimization
Despite of decades of work, query optimizers still make mistakes on
"difficult" queries because of bad cardinality estimates, often due to the
interaction of multiple predicates and correlations in the data. In this paper,
we propose a low-cost post-processing step that can take a plan produced by the
optimizer, detect when it is likely to have made such a mistake, and take steps
to fix it. Specifically, our solution is a sampling-based iterative procedure
that requires almost no changes to the original query optimizer or query
evaluation mechanism of the system. We show that this indeed imposes low
overhead and catches cases where three widely used optimizers (PostgreSQL and
two commercial systems) make large errors.Comment: This is the extended version of a paper with the same title and
authors that appears in the Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD 2016
FactorJoin: A New Cardinality Estimation Framework for Join Queries
Cardinality estimation is one of the most fundamental and challenging
problems in query optimization. Neither classical nor learning-based methods
yield satisfactory performance when estimating the cardinality of the join
queries. They either rely on simplified assumptions leading to ineffective
cardinality estimates or build large models to understand the data
distributions, leading to long planning times and a lack of generalizability
across queries.
In this paper, we propose a new framework FactorJoin for estimating join
queries. FactorJoin combines the idea behind the classical join-histogram
method to efficiently handle joins with the learning-based methods to
accurately capture attribute correlation. Specifically, FactorJoin scans every
table in a DB and builds single-table conditional distributions during an
offline preparation phase. When a join query comes, FactorJoin translates it
into a factor graph model over the learned distributions to effectively and
efficiently estimate its cardinality.
Unlike existing learning-based methods, FactorJoin does not need to
de-normalize joins upfront or require executed query workloads to train the
model. Since it only relies on single-table statistics, FactorJoin has small
space overhead and is extremely easy to train and maintain. In our evaluation,
FactorJoin can produce more effective estimates than the previous
state-of-the-art learning-based methods, with 40x less estimation latency, 100x
smaller model size, and 100x faster training speed at comparable or better
accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within
one second to optimize the query plan, which is very close to the traditional
cardinality estimators in commercial DBMS.Comment: Paper accepted by SIGMOD 202
IO-Top-k: index-access optimized top-k query processing
Top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-k queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. This procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. This entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. The prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. The current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. Our main contributions are new, principled, scheduling methods based on a Knapsack-related optimization for sequential accesses and a cost model for random accesses. The methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. We also discuss efficient implementation techniques for the underlying data structures. In performance experiments with three different datasets (TREC Terabyte, HTTP server logs, and IMDB), our methods achieved significant performance gains compared to the best previously known methods: a factor of up to 3 in terms of execution costs, and a factor of 5 in terms of absolute run-times of our implementation. Our best techniques are close to a lower bound for the execution cost of the considered class of threshold algorithms
- …