1,616 research outputs found
QuickSel: Quick Selectivity Learning with Mixture Models
Estimating the selectivity of a query is a key step in almost any cost-based
query optimizer. Most of today's databases rely on histograms or samples that
are periodically refreshed by re-scanning the data as the underlying data
changes. Since frequent scans are costly, these statistics are often stale and
lead to poor selectivity estimates. As an alternative to scans, query-driven
histograms have been proposed, which refine the histograms based on the actual
selectivities of the observed queries. Unfortunately, these approaches are
either too costly to use in practice---i.e., require an exponential number of
buckets---or quickly lose their advantage as they observe more queries.
In this paper, we propose a selectivity learning framework, called QuickSel,
which falls into the query-driven paradigm but does not use histograms.
Instead, it builds an internal model of the underlying data, which can be
refined significantly faster (e.g., only 1.9 milliseconds for 300 queries).
This fast refinement allows QuickSel to continuously learn from each query and
yield increasingly more accurate selectivity estimates over time. Unlike
query-driven histograms, QuickSel relies on a mixture model and a new
optimization algorithm for training its model. Our extensive experiments on two
real-world datasets confirm that, given the same target accuracy, QuickSel is
34.0x-179.4x faster than state-of-the-art query-driven histograms, including
ISOMER and STHoles. Further, given the same space budget, QuickSel is
26.8%-91.8% more accurate than periodically-updated histograms and samples,
respectively
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Recommended from our members
Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty
Data management is becoming increasingly important in many applications, in particular, in large scientific databases where (1) data can be naturally modeled by continuous random variables, and (2) queries can involve complex predicates and/or be difficult for users to express explicitly. My thesis work aims to provide efficient support to both the data uncertainty and the query uncertainty .
When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases.
Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead.
Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration)
- …