6,556 research outputs found
Variability in data streams
We consider the problem of tracking with small relative error an integer
function defined by a distributed update stream . Existing
streaming algorithms with worst-case guarantees for this problem assume
to be monotone; there are very large lower bounds on the space requirements for
summarizing a distributed non-monotonic stream, often linear in the size of
the stream.
Input streams that give rise to large space requirements are highly variable,
making relatively large jumps from one timestep to the next. However, streams
often vary slowly in practice. What has heretofore been lacking is a framework
for non-monotonic streams that admits algorithms whose worst-case performance
is as good as existing algorithms for monotone streams and degrades gracefully
for non-monotonic streams as those streams vary more quickly.
In this paper we propose such a framework. We introduce a new stream
parameter, the "variability" , deriving its definition in a way that shows
it to be a natural parameter to consider for non-monotonic streams. It is also
a useful parameter. From a theoretical perspective, we can adapt existing
algorithms for monotone streams to work for non-monotonic streams, with only
minor modifications, in such a way that they reduce to the monotone case when
the stream happens to be monotone, and in such a way that we can refine the
worst-case communication bounds from to . From a
practical perspective, we demonstrate that can be small in practice by
proving that is for monotone streams and for streams
that are "nearly" monotone or that are generated by random walks. We expect
to be for many other interesting input classes as well.Comment: submitted to ICALP 2015 (here, fullpage formatting
Time lower bounds for nonadaptive turnstile streaming algorithms
We say a turnstile streaming algorithm is "non-adaptive" if, during updates,
the memory cells written and read depend only on the index being updated and
random coins tossed at the beginning of the stream (and not on the memory
contents of the algorithm). Memory cells read during queries may be decided
upon adaptively. All known turnstile streaming algorithms in the literature are
non-adaptive.
We prove the first non-trivial update time lower bounds for both randomized
and deterministic turnstile streaming algorithms, which hold when the
algorithms are non-adaptive. While there has been abundant success in proving
space lower bounds, there have been no non-trivial update time lower bounds in
the turnstile model. Our lower bounds hold against classically studied problems
such as heavy hitters, point query, entropy estimation, and moment estimation.
In some cases of deterministic algorithms, our lower bounds nearly match known
upper bounds
A High-Performance Algorithm for Identifying Frequent Items in Data Streams
Estimating frequencies of items over data streams is a common building block
in streaming data measurement and analysis. Misra and Gries introduced their
seminal algorithm for the problem in 1982, and the problem has since been
revisited many times due its practicality and applicability. We describe a
highly optimized version of Misra and Gries' algorithm that is suitable for
deployment in industrial settings. Our code is made public via an open source
library called DataSketches that is already used by several companies and
production systems.
Our algorithm improves on two theoretical and practical aspects of prior
work. First, it handles weighted updates in amortized constant time, a common
requirement in practice. Second, it uses a simple and fast method for merging
summaries that asymptotically improves on prior work even for unweighted
streams. We describe experiments confirming that our algorithms are more
efficient than prior proposals.Comment: Typo correction
An algebraic approach to complexity of data stream computations
We consider a basic problem in the general data streaming model, namely, to
estimate a vector that is arbitrarily updated (i.e., incremented
or decremented) coordinate-wise. The estimate must satisfy
\norm{\hat{f}-f}_{\infty}\le \epsilon\norm{f}_1 , that is, \forall i
~(\abs{\hat{f}_i - f_i} \le \epsilon \norm{f}_1). It is known to have
randomized space upper bound \cite{cm:jalgo},
space lower bound
\cite{bkmt:sirocco03} and deterministic space upper bound of
bits.\footnote{The and
notations suppress poly-logarithmic factors in n, \log
\epsilon^{-1}, \norm{f}_{\infty} and , where, is
the error probability (for randomized algorithm).} We show that any
deterministic algorithm for this problem requires space \Omega(\epsilon^{-2}
(\log \norm{f}_1)) bits.Comment: Revised versio
Coresets and Sketches
Geometric data summarization has become an essential tool in both geometric
approximation algorithms and where geometry intersects with big data problems.
In linear or near-linear time large data sets can be compressed into a summary,
and then more intricate algorithms can be run on the summaries whose results
approximate those of the full data set. Coresets and sketches are the two most
important classes of these summaries. We survey five types of coresets and
sketches: shape-fitting, density estimation, high-dimensional vectors,
high-dimensional point sets / matrices, and clustering.Comment: Near-final version of Chapter 49 in Handbook on Discrete and
Computational Geometry, 3rd editio
Straggler Identification in Round-Trip Data Streams via Newton's Identities and Invertible Bloom Filters
We introduce the straggler identification problem, in which an algorithm must
determine the identities of the remaining members of a set after it has had a
large number of insertion and deletion operations performed on it, and now has
relatively few remaining members. The goal is to do this in o(n) space, where n
is the total number of identities. The straggler identification problem has
applications, for example, in determining the set of unacknowledged packets in
a high-bandwidth multicast data stream. We provide a deterministic solution to
the straggler identification problem that uses only O(d log n) bits and is
based on a novel application of Newton's identities for symmetric polynomials.
This solution can identify any subset of d stragglers from a set of n O(log
n)-bit identifiers, assuming that there are no false deletions of identities
not already in the set. Indeed, we give a lower bound argument that shows that
any small-space deterministic solution to the straggler identification problem
cannot be guaranteed to handle false deletions. Nevertheless, we show that
there is a simple randomized solution using O(d log n log(1/epsilon)) bits that
can maintain a multiset and solve the straggler identification problem,
tolerating false deletions, where epsilon>0 is a user-defined parameter
bounding the probability of an incorrect response. This randomized solution is
based on a new type of Bloom filter, which we call the invertible Bloom filter.Comment: Fuller version of paper appearing in 10th Worksh. Algorithms and Data
Structures, Halifax, Nova Scotia, 200
Sliding Bloom Filters
A Bloom filter is a method for reducing the space (memory) required for
representing a set by allowing a small error probability. In this paper we
consider a \emph{Sliding Bloom Filter}: a data structure that, given a stream
of elements, supports membership queries of the set of the last elements (a
sliding window), while allowing a small error probability. We formally define
the data structure and its relevant parameters and analyze the time and memory
requirements needed to achieve them. We give a low space construction that runs
in O(1) time per update with high probability (that is, for all sequences with
high probability all operations take constant time) and provide an almost
matching lower bound on the space that shows that our construction has the best
possible space consumption up to an additive lower order term
Achieving Near MAP Performance with an Excited Markov Chain Monte Carlo MIMO Detector
We introduce a revised derivation of the bitwise Markov Chain Monte Carlo
(MCMC) multiple-input multiple-output (MIMO) detector. The new approach
resolves the previously reported high SNR stalling problem of MCMC without the
need for hybridization with another detector method or adding heuristic
temperature scaling factors. Another common problem with MCMC algorithms is the
unknown convergence time making predictable fixed-length implementations
problematic. When an insufficient number of iterations is used on a slowly
converging example, the output LLRs can be unstable and overconfident.
Therefore, we develop a method to identify rare slowly converging runs and
mitigate their degrading effects on the soft-output information. This improves
forward-error-correcting code performance and removes a symptomatic error floor
in BER plots. Next, pseudo-convergence is identified with a novel way to
visualize the internal behavior of the Gibbs sampler. An effective and
efficient pseudo-convergence detection and escape strategy is suggested.
Finally, the new excited MCMC (X-MCMC) detector is shown to have near
maximum-a-posteriori (MAP) performance even with challenging, realistic,
highly-correlated channels at the maximum MIMO sizes and modulation rates
supported by the 802.11ac WiFi specification, 8x8 MIMO 256 QAM
Online Algorithms for Factorization-Based Structure from Motion
We present a family of online algorithms for real-time factorization-based
structure from motion, leveraging a relationship between incremental singular
value decomposition and recently proposed methods for online matrix completion.
Our methods are orders of magnitude faster than previous state of the art, can
handle missing data and a variable number of feature points, and are robust to
noise and sparse outliers. We demonstrate our methods on both real and
synthetic sequences and show that they perform well in both online and batch
settings. We also provide an implementation which is able to produce 3D models
in real time using a laptop with a webcam
Perfect Sampling in a Data Stream
In this paper, we resolve the one-pass space complexity of sampling for
. Given a stream of updates (insertions and deletions) to the
coordinates of an underlying vector , a perfect
sampler must output an index with probability , and is
allowed to fail with some probability . So far, for no
algorithm has been shown to solve the problem exactly using -bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate
sampler, which outputs with probability , using space polynomial in and . The space
complexity was later reduced by Jowhari, Sa\u{g}lam, and Tardos to roughly
for , which tightly
matches the lower bound in terms of and
, but is loose in terms of .
Given these nearly tight bounds, it is perhaps surprising that no lower bound
exists in terms of ---not even a bound of is known. In
this paper, we explain this phenomenon by demonstrating the existence of an
-bit perfect sampler for .
This shows that need not factor into the space of an sampler, which
closes the complexity of the problem for this range of . For , our
bound is -bits, which matches the prior best
known upper bound in terms of , but has no dependence on . For
, our bound holds in the random oracle model, matching the lower bounds in
that model. Moreover, we show that our algorithm can be derandomized with only
a blow-up in the space (and no blow-up for ). Our
derandomization technique is general, and can be used to derandomize a large
class of linear sketches.Comment: An earlier version of this work appeared in FOCS 2018, but contained
an error in the derandomization. In this version, we correct this issue,
albeit with a (log log n)^2 -factor increase in the space required to
derandomize the algorithm for p<2. Our results in the random oracle model and
for p = 2 are unaffected. We also give alternative algorithms and additional
applications.
- …