Search CORE

457 research outputs found

In-Memory Trajectory Indexing for On-The-Fly Travel-Time Estimation

Author: Waury Robert
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2019
Field of study

QuickSel: Quick Selectivity Learning with Mixture Models

Author: Aboulnaga A.
Agrawal S.
Anagnostopoulos C.
Asparouhov T.
Chaudhuri S.
Gupta A.
Jagadish H.
Jagadish H. V.
Khachatryan A.
Kraska T.
Lam E.
Lim L.
Lin X.
Lynch C. A.
Markl V.
Markl V.
Rubner Y.
Ré C.
Ré C.
Stillger M.
Sun J.
Tzoumas K.
Van Gelder A.
Wu Y.
Yang J.
Zhang Q.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/04/2020
Field of study

Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice---i.e., require an exponential number of buckets---or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x-179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%-91.8% more accurate than periodically-updated histograms and samples, respectively

arXiv.org e-Print Archive

Crossref

Adaptive Selectivity Estimation Using Query Feedback

Author: Chen Chungmin Melvin
Roussopoulos Nick
Publication venue
Publication date: 15/10/1998
Field of study

Digital Repository at the University of Maryland

Recommended from our members

Generating Cost Models of User-Defined Functions

Author: Lee Byung S.
Publication venue: University of Vermont
Publication date: 04/02/2007
Field of study

UNT Digital Library

Database Optimization Aspects for Information Retrieval

Author: Blok H.E.
Publication venue: Twente University Press
Publication date: 01/01/2002
Field of study

There is a growing need for systems that can process queries, combining both structured data and text. One way to provide such functionality is to integrate information retrieval (IR) techniques in a database management system (DBMS). However, both IR and database research have been separate research fields for decades, resulting in different - even conflicting - approaches to data management. Each DBMS has a component called a "query optimizer", which plays a crucial role in the efficiency and flexibility of the system. So, for successful integration the IR techniques and data structures, as well as the DBMS query optimizer, should be adapted to enable mutual cooperation. The author concentrates on top-N queries - a common class of IR queries. An IR top-N query asks for the N best documents given a set of keywords. The author proposes processing the data in batches as a compromise between IR and DBMS query processing. Experiments with this technique show that porting IR optimization techniques is (still) not a promising option due to the additional administrative overhead. Two new mathematical models are introduced to eliminate this overhead: a model that predicts selectivity, which is a crucial factor in the execution costs, and a model that predicts the quality of the top-N

University of Twente Research Information

Towards a Query Optimizer for Text-Centric Tasks

Author: Agichtein Eugene
Gravano Luis
Ipeirotis Panagiotis G.
Jain Pranay
Publication venue: Stern School of Business, New York University
Publication date: 28/10/2006
Field of study

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or "crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie

New York University Faculty Digital Archive

Cost-Based Optimization of Integration Flows

Author: Böhm Matthias
Publication venue
Publication date: 15/03/2011
Field of study

Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation

Technische Universität Dresden: Qucosa

Clustering-Initialized Adaptive Histograms and Probabilistic Cost Estimation for Query Optimization

Author: Khachatryan Andranik
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2012
Field of study

An assumption with self-tuning histograms has been that they can "learn" the dataset if given enough training queries. We show that this is not the case with the current approaches. The quality of the histogram depends on the initial configuration. Starting with few good buckets can improve the efficiency of learning. Without this, the histogram is likely to stagnate, i.e. converge to a bad configuration and stop learning. We also present a probabilistic cost estimation model

KITopen