20,323 research outputs found
Approximate query processing in a data warehouse using random sampling
Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data
among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency.
We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input.
We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions.
Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times
Database Learning: Toward a Database that Becomes Smarter Every Time
In today's databases, previous query answers rarely benefit answering future
queries. For the first time, to the best of our knowledge, we change this
paradigm in an approximate query processing (AQP) context. We make the
following observation: the answer to each query reveals some degree of
knowledge about the answer to another query because their answers stem from the
same underlying distribution that has produced the entire dataset. Exploiting
and refining this knowledge should allow us to answer queries more
analytically, rather than by reading enormous amounts of raw data. Also,
processing more queries should continuously enhance our knowledge of the
underlying distribution, and hence lead to increasingly faster response times
for future queries.
We call this novel idea---learning from past query answers---Database
Learning. We exploit the principle of maximum entropy to produce answers, which
are in expectation guaranteed to be more accurate than existing sample-based
approximations. Empowered by this idea, we build a query engine on top of Spark
SQL, called Verdict. We conduct extensive experiments on real-world query
traces from a large customer of a major database vendor. Our results
demonstrate that Verdict supports 73.7% of these queries, speeding them up by
up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM
SIGMOD conference 201
Rapid Sampling for Visualizations with Ordering Guarantees
Visualizations are frequently used as a means to understand trends and gather
insights from datasets, but often take a long time to generate. In this paper,
we focus on the problem of rapidly generating approximate visualizations while
preserving crucial visual proper- ties of interest to analysts. Our primary
focus will be on sampling algorithms that preserve the visual property of
ordering; our techniques will also apply to some other visual properties. For
instance, our algorithms can be used to generate an approximate visualization
of a bar chart very rapidly, where the comparisons between any two bars are
correct. We formally show that our sampling algorithms are generally applicable
and provably optimal in theory, in that they do not take more samples than
necessary to generate the visualizations with ordering guarantees. They also
work well in practice, correctly ordering output groups while taking orders of
magnitude fewer samples and much less time than conventional sampling schemes.Comment: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No.
DROP: Dimensionality Reduction Optimization for Time Series
Dimensionality reduction is a critical step in scaling machine learning
pipelines. Principal component analysis (PCA) is a standard tool for
dimensionality reduction, but performing PCA over a full dataset can be
prohibitively expensive. As a result, theoretical work has studied the
effectiveness of iterative, stochastic PCA methods that operate over data
samples. However, termination conditions for stochastic PCA either execute for
a predetermined number of iterations, or until convergence of the solution,
frequently sampling too many or too few datapoints for end-to-end runtime
improvements. We show how accounting for downstream analytics operations during
DR via PCA allows stochastic methods to efficiently terminate after operating
over small (e.g., 1%) subsamples of input data, reducing whole workload
runtime. Leveraging this, we propose DROP, a DR optimizer that enables speedups
of up to 5x over Singular-Value-Decomposition-based PCA techniques, and exceeds
conventional approaches like FFT and PAA by up to 16x in end-to-end workloads
Unbiased Comparative Evaluation of Ranking Functions
Eliciting relevance judgments for ranking evaluation is labor-intensive and
costly, motivating careful selection of which documents to judge. Unlike
traditional approaches that make this selection deterministically,
probabilistic sampling has shown intriguing promise since it enables the design
of estimators that are provably unbiased even when reusing data with missing
judgments. In this paper, we first unify and extend these sampling approaches
by viewing the evaluation problem as a Monte Carlo estimation task that applies
to a large number of common IR metrics. Drawing on the theoretical clarity that
this view offers, we tackle three practical evaluation scenarios: comparing two
systems, comparing systems against a baseline, and ranking systems. For
each scenario, we derive an estimator and a variance-optimizing sampling
distribution while retaining the strengths of sampling-based evaluation,
including unbiasedness, reusability despite missing data, and ease of use in
practice. In addition to the theoretical contribution, we empirically evaluate
our methods against previously used sampling heuristics and find that they
generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
A Bandit Approach to Maximum Inner Product Search
There has been substantial research on sub-linear time approximate algorithms
for Maximum Inner Product Search (MIPS). To achieve fast query time,
state-of-the-art techniques require significant preprocessing, which can be a
burden when the number of subsequent queries is not sufficiently large to
amortize the cost. Furthermore, existing methods do not have the ability to
directly control the suboptimality of their approximate results with
theoretical guarantees. In this paper, we propose the first approximate
algorithm for MIPS that does not require any preprocessing, and allows users to
control and bound the suboptimality of the results. We cast MIPS as a Best Arm
Identification problem, and introduce a new bandit setting that can fully
exploit the special structure of MIPS. Our approach outperforms
state-of-the-art methods on both synthetic and real-world datasets.Comment: AAAI 201
- …