30,655 research outputs found
PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation
Online aggregation provides estimates to the final result of a computation
during the actual processing. The user can stop the computation as soon as the
estimate is accurate enough, typically early in the execution. This allows for
the interactive data exploration of the largest datasets. In this paper we
introduce the first framework for parallel online aggregation in which the
estimation virtually does not incur any overhead on top of the actual
execution. We define a generic interface to express any estimation model that
abstracts completely the execution details. We design a novel estimator
specifically targeted at parallel online aggregation. When executed by the
framework over a massive TPC-H instance, the estimator provides
accurate confidence bounds early in the execution even when the cardinality of
the final result is seven orders of magnitude smaller than the dataset size and
without incurring overhead.Comment: 36 page
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
An Iterative Scheme for Leverage-based Approximate Aggregation
The current data explosion poses great challenges to the approximate
aggregation with an efficiency and accuracy. To address this problem, we
propose a novel approach to calculate the aggregation answers with a high
accuracy using only a small portion of the data. We introduce leverages to
reflect individual differences in the samples from a statistical perspective.
Two kinds of estimators, the leverage-based estimator, and the sketch estimator
(a "rough picture" of the aggregation answer), are in constraint relations and
iteratively improved according to the actual conditions until their difference
is below a threshold. Due to the iteration mechanism and the leverages, our
approach achieves a high accuracy. Moreover, some features, such as not
requiring recording the sampled data and easy to extend to various execution
modes (e.g., the online mode), make our approach well suited to deal with big
data. Experiments show that our approach has an extraordinary performance, and
when compared with the uniform sampling, our approach can achieve high-quality
answers with only 1/3 of the same sample size.Comment: 17 pages, 9 figure
Early Accurate Results for Advanced Analytics on MapReduce
Approximate results based on samples often provide the only way in which
advanced analytical applications on very massive data sets can satisfy their
time and resource constraints. Unfortunately, methods and tools for the
computation of accurate early results are currently not supported in
MapReduce-oriented systems although these are intended for `big data'.
Therefore, we proposed and implemented a non-parametric extension of Hadoop
which allows the incremental computation of early results for arbitrary
work-flows, along with reliable on-line estimates of the degree of accuracy
achieved so far in the computation. These estimates are based on a technique
called bootstrapping that has been widely employed in statistics and can be
applied to arbitrary functions and data distributions. In this paper, we
describe our Early Accurate Result Library (EARL) for Hadoop that was designed
to minimize the changes required to the MapReduce framework. Various tests of
EARL of Hadoop are presented to characterize the frequent situations where EARL
can provide major speed-ups over the current version of Hadoop.Comment: VLDB201
A reusable iterative optimization software library to solve combinatorial problems with approximate reasoning
Real world combinatorial optimization problems such as scheduling are
typically too complex to solve with exact methods. Additionally, the problems
often have to observe vaguely specified constraints of different importance,
the available data may be uncertain, and compromises between antagonistic
criteria may be necessary. We present a combination of approximate reasoning
based constraints and iterative optimization based heuristics that help to
model and solve such problems in a framework of C++ software libraries called
StarFLIP++. While initially developed to schedule continuous caster units in
steel plants, we present in this paper results from reusing the library
components in a shift scheduling system for the workforce of an industrial
production plant.Comment: 33 pages, 9 figures; for a project overview see
http://www.dbai.tuwien.ac.at/proj/StarFLIP
- …