34,313 research outputs found
Spectra: Robust Estimation of Distribution Functions in Networks
Distributed aggregation allows the derivation of a given global aggregate
property from many individual local values in nodes of an interconnected
network system. Simple aggregates such as minima/maxima, counts, sums and
averages have been thoroughly studied in the past and are important tools for
distributed algorithms and network coordination. Nonetheless, this kind of
aggregates may not be comprehensive enough to characterize biased data
distributions or when in presence of outliers, making the case for richer
estimates of the values on the network. This work presents Spectra, a
distributed algorithm for the estimation of distribution functions over large
scale networks. The estimate is available at all nodes and the technique
depicts important properties, namely: robust when exposed to high levels of
message loss, fast convergence speed and fine precision in the estimate. It can
also dynamically cope with changes of the sampled local property, not requiring
algorithm restarts, and is highly resilient to node churn. The proposed
approach is experimentally evaluated and contrasted to a competing state of the
art distribution aggregation technique.Comment: Full version of the paper published at 12th IFIP International
Conference on Distributed Applications and Interoperable Systems (DAIS),
Stockholm (Sweden), June 201
Fast and Robust Rank Aggregation against Model Misspecification
In rank aggregation, preferences from different users are summarized into a
total order under the homogeneous data assumption. Thus, model misspecification
arises and rank aggregation methods take some noise models into account.
However, they all rely on certain noise model assumptions and cannot handle
agnostic noises in the real world. In this paper, we propose CoarsenRank, which
rectifies the underlying data distribution directly and aligns it to the
homogeneous data assumption without involving any noise model. To this end, we
define a neighborhood of the data distribution over which Bayesian inference of
CoarsenRank is performed, and therefore the resultant posterior enjoys
robustness against model misspecification. Further, we derive a tractable
closed-form solution for CoarsenRank making it computationally efficient.
Experiments on real-world datasets show that CoarsenRank is fast and robust,
achieving consistent improvement over baseline methods
An Iterative Scheme for Leverage-based Approximate Aggregation
The current data explosion poses great challenges to the approximate
aggregation with an efficiency and accuracy. To address this problem, we
propose a novel approach to calculate the aggregation answers with a high
accuracy using only a small portion of the data. We introduce leverages to
reflect individual differences in the samples from a statistical perspective.
Two kinds of estimators, the leverage-based estimator, and the sketch estimator
(a "rough picture" of the aggregation answer), are in constraint relations and
iteratively improved according to the actual conditions until their difference
is below a threshold. Due to the iteration mechanism and the leverages, our
approach achieves a high accuracy. Moreover, some features, such as not
requiring recording the sampled data and easy to extend to various execution
modes (e.g., the online mode), make our approach well suited to deal with big
data. Experiments show that our approach has an extraordinary performance, and
when compared with the uniform sampling, our approach can achieve high-quality
answers with only 1/3 of the same sample size.Comment: 17 pages, 9 figure
Early Accurate Results for Advanced Analytics on MapReduce
Approximate results based on samples often provide the only way in which
advanced analytical applications on very massive data sets can satisfy their
time and resource constraints. Unfortunately, methods and tools for the
computation of accurate early results are currently not supported in
MapReduce-oriented systems although these are intended for `big data'.
Therefore, we proposed and implemented a non-parametric extension of Hadoop
which allows the incremental computation of early results for arbitrary
work-flows, along with reliable on-line estimates of the degree of accuracy
achieved so far in the computation. These estimates are based on a technique
called bootstrapping that has been widely employed in statistics and can be
applied to arbitrary functions and data distributions. In this paper, we
describe our Early Accurate Result Library (EARL) for Hadoop that was designed
to minimize the changes required to the MapReduce framework. Various tests of
EARL of Hadoop are presented to characterize the frequent situations where EARL
can provide major speed-ups over the current version of Hadoop.Comment: VLDB201
Global parameter identification of stochastic reaction networks from single trajectories
We consider the problem of inferring the unknown parameters of a stochastic
biochemical network model from a single measured time-course of the
concentration of some of the involved species. Such measurements are available,
e.g., from live-cell fluorescence microscopy in image-based systems biology. In
addition, fluctuation time-courses from, e.g., fluorescence correlation
spectroscopy provide additional information about the system dynamics that can
be used to more robustly infer parameters than when considering only mean
concentrations. Estimating model parameters from a single experimental
trajectory enables single-cell measurements and quantification of cell--cell
variability. We propose a novel combination of an adaptive Monte Carlo sampler,
called Gaussian Adaptation, and efficient exact stochastic simulation
algorithms that allows parameter identification from single stochastic
trajectories. We benchmark the proposed method on a linear and a non-linear
reaction network at steady state and during transient phases. In addition, we
demonstrate that the present method also provides an ellipsoidal volume
estimate of the viable part of parameter space and is able to estimate the
physical volume of the compartment in which the observed reactions take place.Comment: Article in print as a book chapter in Springer's "Advances in Systems
Biology
Stream Aggregation Through Order Sampling
This is paper introduces a new single-pass reservoir weighted-sampling stream
aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling
is a powerful and e cient method for weighted sampling from a stream of
uniquely keyed items, there is no current algorithm that realizes the benefits
of order sampling in the context of stream aggregation over non-unique keys. A
naive approach to order sample regardless of key then aggregate the results is
hopelessly inefficient. In distinction, our proposed algorithm uses a single
persistent random variable across the lifetime of each key in the cache, and
maintains unbiased estimates of the key aggregates that can be queried at any
point in the stream. The basic approach can be supplemented with a Sample and
Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This
approach represents a considerable reduction in computational complexity
compared with the state of the art in adapting Sample and Hold to operate with
a fixed cache size. Concerning statistical properties, we prove that PBA
provides unbiased estimates of the true aggregates. We analyze the
computational complexity of PBA and its variants, and provide a detailed
evaluation of its accuracy on synthetic and trace data. Weighted relative error
is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive
Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page
- …