6,556 research outputs found

    Variability in data streams

    Full text link
    We consider the problem of tracking with small relative error an integer function f(n)f(n) defined by a distributed update stream f(n)f'(n). Existing streaming algorithms with worst-case guarantees for this problem assume f(n)f(n) to be monotone; there are very large lower bounds on the space requirements for summarizing a distributed non-monotonic stream, often linear in the size nn of the stream. Input streams that give rise to large space requirements are highly variable, making relatively large jumps from one timestep to the next. However, streams often vary slowly in practice. What has heretofore been lacking is a framework for non-monotonic streams that admits algorithms whose worst-case performance is as good as existing algorithms for monotone streams and degrades gracefully for non-monotonic streams as those streams vary more quickly. In this paper we propose such a framework. We introduce a new stream parameter, the "variability" vv, deriving its definition in a way that shows it to be a natural parameter to consider for non-monotonic streams. It is also a useful parameter. From a theoretical perspective, we can adapt existing algorithms for monotone streams to work for non-monotonic streams, with only minor modifications, in such a way that they reduce to the monotone case when the stream happens to be monotone, and in such a way that we can refine the worst-case communication bounds from Θ(n)\Theta(n) to O~(v)\tilde{O}(v). From a practical perspective, we demonstrate that vv can be small in practice by proving that vv is O(logf(n))O(\log f(n)) for monotone streams and o(n)o(n) for streams that are "nearly" monotone or that are generated by random walks. We expect vv to be o(n)o(n) for many other interesting input classes as well.Comment: submitted to ICALP 2015 (here, fullpage formatting

    Time lower bounds for nonadaptive turnstile streaming algorithms

    Full text link
    We say a turnstile streaming algorithm is "non-adaptive" if, during updates, the memory cells written and read depend only on the index being updated and random coins tossed at the beginning of the stream (and not on the memory contents of the algorithm). Memory cells read during queries may be decided upon adaptively. All known turnstile streaming algorithms in the literature are non-adaptive. We prove the first non-trivial update time lower bounds for both randomized and deterministic turnstile streaming algorithms, which hold when the algorithms are non-adaptive. While there has been abundant success in proving space lower bounds, there have been no non-trivial update time lower bounds in the turnstile model. Our lower bounds hold against classically studied problems such as heavy hitters, point query, entropy estimation, and moment estimation. In some cases of deterministic algorithms, our lower bounds nearly match known upper bounds

    A High-Performance Algorithm for Identifying Frequent Items in Data Streams

    Full text link
    Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited many times due its practicality and applicability. We describe a highly optimized version of Misra and Gries' algorithm that is suitable for deployment in industrial settings. Our code is made public via an open source library called DataSketches that is already used by several companies and production systems. Our algorithm improves on two theoretical and practical aspects of prior work. First, it handles weighted updates in amortized constant time, a common requirement in practice. Second, it uses a simple and fast method for merging summaries that asymptotically improves on prior work even for unweighted streams. We describe experiments confirming that our algorithms are more efficient than prior proposals.Comment: Typo correction

    An algebraic approach to complexity of data stream computations

    Full text link
    We consider a basic problem in the general data streaming model, namely, to estimate a vector fZnf \in \Z^n that is arbitrarily updated (i.e., incremented or decremented) coordinate-wise. The estimate f^Zn\hat{f} \in \Z^n must satisfy \norm{\hat{f}-f}_{\infty}\le \epsilon\norm{f}_1 , that is, \forall i ~(\abs{\hat{f}_i - f_i} \le \epsilon \norm{f}_1). It is known to have O~(ϵ1)\tilde{O}(\epsilon^{-1}) randomized space upper bound \cite{cm:jalgo}, Ω(ϵ1log(ϵn))\Omega(\epsilon^{-1} \log (\epsilon n)) space lower bound \cite{bkmt:sirocco03} and deterministic space upper bound of Ω~(ϵ2)\tilde{\Omega}(\epsilon^{-2}) bits.\footnote{The O~\tilde{O} and Ω~\tilde{\Omega} notations suppress poly-logarithmic factors in n, \log \epsilon^{-1}, \norm{f}_{\infty} and logδ1\log \delta^{-1}, where, δ\delta is the error probability (for randomized algorithm).} We show that any deterministic algorithm for this problem requires space \Omega(\epsilon^{-2} (\log \norm{f}_1)) bits.Comment: Revised versio

    Coresets and Sketches

    Full text link
    Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries. We survey five types of coresets and sketches: shape-fitting, density estimation, high-dimensional vectors, high-dimensional point sets / matrices, and clustering.Comment: Near-final version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd editio

    Straggler Identification in Round-Trip Data Streams via Newton's Identities and Invertible Bloom Filters

    Full text link
    We introduce the straggler identification problem, in which an algorithm must determine the identities of the remaining members of a set after it has had a large number of insertion and deletion operations performed on it, and now has relatively few remaining members. The goal is to do this in o(n) space, where n is the total number of identities. The straggler identification problem has applications, for example, in determining the set of unacknowledged packets in a high-bandwidth multicast data stream. We provide a deterministic solution to the straggler identification problem that uses only O(d log n) bits and is based on a novel application of Newton's identities for symmetric polynomials. This solution can identify any subset of d stragglers from a set of n O(log n)-bit identifiers, assuming that there are no false deletions of identities not already in the set. Indeed, we give a lower bound argument that shows that any small-space deterministic solution to the straggler identification problem cannot be guaranteed to handle false deletions. Nevertheless, we show that there is a simple randomized solution using O(d log n log(1/epsilon)) bits that can maintain a multiset and solve the straggler identification problem, tolerating false deletions, where epsilon>0 is a user-defined parameter bounding the probability of an incorrect response. This randomized solution is based on a new type of Bloom filter, which we call the invertible Bloom filter.Comment: Fuller version of paper appearing in 10th Worksh. Algorithms and Data Structures, Halifax, Nova Scotia, 200

    Sliding Bloom Filters

    Full text link
    A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error probability. In this paper we consider a \emph{Sliding Bloom Filter}: a data structure that, given a stream of elements, supports membership queries of the set of the last nn elements (a sliding window), while allowing a small error probability. We formally define the data structure and its relevant parameters and analyze the time and memory requirements needed to achieve them. We give a low space construction that runs in O(1) time per update with high probability (that is, for all sequences with high probability all operations take constant time) and provide an almost matching lower bound on the space that shows that our construction has the best possible space consumption up to an additive lower order term

    Achieving Near MAP Performance with an Excited Markov Chain Monte Carlo MIMO Detector

    Full text link
    We introduce a revised derivation of the bitwise Markov Chain Monte Carlo (MCMC) multiple-input multiple-output (MIMO) detector. The new approach resolves the previously reported high SNR stalling problem of MCMC without the need for hybridization with another detector method or adding heuristic temperature scaling factors. Another common problem with MCMC algorithms is the unknown convergence time making predictable fixed-length implementations problematic. When an insufficient number of iterations is used on a slowly converging example, the output LLRs can be unstable and overconfident. Therefore, we develop a method to identify rare slowly converging runs and mitigate their degrading effects on the soft-output information. This improves forward-error-correcting code performance and removes a symptomatic error floor in BER plots. Next, pseudo-convergence is identified with a novel way to visualize the internal behavior of the Gibbs sampler. An effective and efficient pseudo-convergence detection and escape strategy is suggested. Finally, the new excited MCMC (X-MCMC) detector is shown to have near maximum-a-posteriori (MAP) performance even with challenging, realistic, highly-correlated channels at the maximum MIMO sizes and modulation rates supported by the 802.11ac WiFi specification, 8x8 MIMO 256 QAM

    Online Algorithms for Factorization-Based Structure from Motion

    Full text link
    We present a family of online algorithms for real-time factorization-based structure from motion, leveraging a relationship between incremental singular value decomposition and recently proposed methods for online matrix completion. Our methods are orders of magnitude faster than previous state of the art, can handle missing data and a variable number of feature points, and are robust to noise and sparse outliers. We demonstrate our methods on both real and synthetic sequences and show that they perform well in both online and batch settings. We also provide an implementation which is able to produce 3D models in real time using a laptop with a webcam

    Perfect LpL_p Sampling in a Data Stream

    Full text link
    In this paper, we resolve the one-pass space complexity of LpL_p sampling for p(0,2)p \in (0,2). Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector fRnf \in \mathbb{R}^n, a perfect LpL_p sampler must output an index ii with probability fip/fpp|f_i|^p/\|f\|_p^p, and is allowed to fail with some probability δ\delta. So far, for p>0p > 0 no algorithm has been shown to solve the problem exactly using poly(logn)\text{poly}( \log n)-bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate LpL_p sampler, which outputs ii with probability (1±ν)fip/fpp(1 \pm \nu)|f_i|^p /\|f\|_p^p, using space polynomial in ν1\nu^{-1} and log(n)\log(n). The space complexity was later reduced by Jowhari, Sa\u{g}lam, and Tardos to roughly O(νplog2nlogδ1)O(\nu^{-p} \log^2 n \log \delta^{-1}) for p(0,2)p \in (0,2), which tightly matches the Ω(log2nlogδ1)\Omega(\log^2 n \log \delta^{-1}) lower bound in terms of nn and δ\delta, but is loose in terms of ν\nu. Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in terms of ν\nu---not even a bound of Ω(ν1)\Omega(\nu^{-1}) is known. In this paper, we explain this phenomenon by demonstrating the existence of an O(log2nlogδ1)O(\log^2 n \log \delta^{-1})-bit perfect LpL_p sampler for p(0,2)p \in (0,2). This shows that ν\nu need not factor into the space of an LpL_p sampler, which closes the complexity of the problem for this range of pp. For p=2p=2, our bound is O(log3nlogδ1)O(\log^3 n \log \delta^{-1})-bits, which matches the prior best known upper bound in terms of n,δn,\delta, but has no dependence on ν\nu. For p<2p<2, our bound holds in the random oracle model, matching the lower bounds in that model. Moreover, we show that our algorithm can be derandomized with only a O((loglogn)2)O((\log \log n)^2) blow-up in the space (and no blow-up for p=2p=2). Our derandomization technique is general, and can be used to derandomize a large class of linear sketches.Comment: An earlier version of this work appeared in FOCS 2018, but contained an error in the derandomization. In this version, we correct this issue, albeit with a (log log n)^2 -factor increase in the space required to derandomize the algorithm for p<2. Our results in the random oracle model and for p = 2 are unaffected. We also give alternative algorithms and additional applications.
    corecore