457 research outputs found
QuickSel: Quick Selectivity Learning with Mixture Models
Estimating the selectivity of a query is a key step in almost any cost-based
query optimizer. Most of today's databases rely on histograms or samples that
are periodically refreshed by re-scanning the data as the underlying data
changes. Since frequent scans are costly, these statistics are often stale and
lead to poor selectivity estimates. As an alternative to scans, query-driven
histograms have been proposed, which refine the histograms based on the actual
selectivities of the observed queries. Unfortunately, these approaches are
either too costly to use in practice---i.e., require an exponential number of
buckets---or quickly lose their advantage as they observe more queries.
In this paper, we propose a selectivity learning framework, called QuickSel,
which falls into the query-driven paradigm but does not use histograms.
Instead, it builds an internal model of the underlying data, which can be
refined significantly faster (e.g., only 1.9 milliseconds for 300 queries).
This fast refinement allows QuickSel to continuously learn from each query and
yield increasingly more accurate selectivity estimates over time. Unlike
query-driven histograms, QuickSel relies on a mixture model and a new
optimization algorithm for training its model. Our extensive experiments on two
real-world datasets confirm that, given the same target accuracy, QuickSel is
34.0x-179.4x faster than state-of-the-art query-driven histograms, including
ISOMER and STHoles. Further, given the same space budget, QuickSel is
26.8%-91.8% more accurate than periodically-updated histograms and samples,
respectively
Database Optimization Aspects for Information Retrieval
There is a growing need for systems that can process queries, combining both structured data and text. One way to provide such functionality is to integrate information retrieval (IR) techniques in a database management system (DBMS). However, both IR and database research have been separate research fields for decades, resulting in different - even conflicting - approaches to data management.
Each DBMS has a component called a "query optimizer", which plays a crucial role in the efficiency and flexibility of the system. So, for successful integration the IR techniques and data structures, as well as the DBMS query optimizer, should be adapted to enable mutual cooperation.
The author concentrates on top-N queries - a common class of IR queries. An IR top-N query asks for the N best documents given a set of keywords. The author proposes processing the data in batches as a compromise between IR and DBMS query processing. Experiments with this technique show that porting IR optimization techniques is (still) not a promising option due to the additional administrative overhead. Two new mathematical models are introduced to eliminate this overhead: a model that predicts selectivity, which is a crucial factor in the execution costs, and a model that predicts the quality of the top-N
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
Cost-Based Optimization of Integration Flows
Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation
Clustering-Initialized Adaptive Histograms and Probabilistic Cost Estimation for Query Optimization
An assumption with self-tuning histograms has been that they can "learn" the dataset if given enough training queries. We show that this is not the case with the current approaches. The quality of the histogram depends on the initial configuration. Starting with few good buckets can improve the efficiency of learning. Without this, the histogram is likely to stagnate, i.e. converge to a bad configuration and stop learning. We also present a probabilistic cost estimation model
- …