5,129 research outputs found
Structure-Aware Sampling: Flexible and Accurate Summarization
In processing large quantities of data, a fundamental problem is to obtain a
summary which supports approximate query answering. Random sampling yields
flexible summaries which naturally support subset-sum queries with unbiased
estimators and well-understood confidence bounds.
Classic sample-based summaries, however, are designed for arbitrary subset
queries and are oblivious to the structure in the set of keys. The particular
structure, such as hierarchy, order, or product space (multi-dimensional),
makes range queries much more relevant for most analysis of the data.
Dedicated summarization algorithms for range-sum queries have also been
extensively studied. They can outperform existing sampling schemes in terms of
accuracy on range queries per summary size. Their accuracy, however, rapidly
degrades when, as is often the case, the query spans multiple ranges. They are
also less flexible - being targeted for range sum queries alone - and are often
quite costly to build and use.
In this paper we propose and evaluate variance optimal sampling schemes that
are structure-aware. These summaries improve over the accuracy of existing
structure-oblivious sampling schemes on range queries while retaining the
benefits of sample-based summaries: flexible summaries, with high accuracy on
both range queries and arbitrary subset queries
PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation
Online aggregation provides estimates to the final result of a computation
during the actual processing. The user can stop the computation as soon as the
estimate is accurate enough, typically early in the execution. This allows for
the interactive data exploration of the largest datasets. In this paper we
introduce the first framework for parallel online aggregation in which the
estimation virtually does not incur any overhead on top of the actual
execution. We define a generic interface to express any estimation model that
abstracts completely the execution details. We design a novel estimator
specifically targeted at parallel online aggregation. When executed by the
framework over a massive TPC-H instance, the estimator provides
accurate confidence bounds early in the execution even when the cardinality of
the final result is seven orders of magnitude smaller than the dataset size and
without incurring overhead.Comment: 36 page
Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information
Random sampling is an essential tool in the processing and transmission of
data. It is used to summarize data too large to store or manipulate and meet
resource constraints on bandwidth or battery power. Estimators that are applied
to the sample facilitate fast approximate processing of queries posed over the
original data and the value of the sample hinges on the quality of these
estimators.
Our work targets data sets such as request and traffic logs and sensor
measurements, where data is repeatedly collected over multiple {\em instances}:
time periods, locations, or snapshots.
We are interested in queries that span multiple instances, such as distinct
counts and distance measures over selected records. These queries are used for
applications ranging from planning to anomaly and change detection.
Unbiased low-variance estimators are particularly effective as the relative
error decreases with the number of selected record keys.
The Horvitz-Thompson estimator, known to minimize variance for sampling with
"all or nothing" outcomes (which reveals exacts value or no information on
estimated quantity), is not optimal for multi-instance operations for which an
outcome may provide partial information.
We present a general principled methodology for the derivation of (Pareto)
optimal unbiased estimators over sampled instances and aim to understand its
potential. We demonstrate significant improvement in estimate accuracy of
fundamental queries for common sampling schemes.Comment: This is a full version of a PODS 2011 pape
Estimating Cardinalities with Deep Sketches
We introduce Deep Sketches, which are compact models of databases that allow
us to estimate the result sizes of SQL queries. Deep Sketches are powered by a
new deep learning approach to cardinality estimation that can capture
correlations between columns, even across tables. Our demonstration allows
users to define such sketches on the TPC-H and IMDb datasets, monitor the
training process, and run ad-hoc queries against trained sketches. We also
estimate query cardinalities with HyPer and PostgreSQL to visualize the gains
over traditional cardinality estimators.Comment: To appear in SIGMOD'1
Distributed top-k aggregation queries at large
Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network
- …