    On the Power of Adaptivity in Sparse Recovery

    The goal of (stable) sparse recovery is to recover a kk-sparse approximation xx* of a vector xx from linear measurements of xx. Specifically, the goal is to recover xx* such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant CC and norm parameters pp and qq. It is known that, for p=q=1p=q=1 or p=q=2p=q=2, this task can be accomplished using m=O(klog(n/k))m=O(k \log (n/k)) non-adaptive measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11]. In this paper we show that if one is allowed to perform measurements that are adaptive, then the number of measurements can be considerably reduced. Specifically, for C=1+epsC=1+eps and p=q=2p=q=2 we show - A scheme with m=O((1/eps)kloglog(neps/k))m=O((1/eps)k log log (n eps/k)) measurements that uses O(logkloglog(neps/k))O(log* k \log \log (n eps/k)) rounds. This is a significant improvement over the best possible non-adaptive bound. - A scheme with m=O((1/eps)klog(k/eps)+klog(n/k))m=O((1/eps) k log (k/eps) + k \log (n/k)) measurements that uses /two/ rounds. This improves over the best possible non-adaptive bound. To the best of our knowledge, these are the first results of this type. As an independent application, we show how to solve the problem of finding a duplicate in a data stream of nn items drawn from 1,2,...,n1{1, 2, ..., n-1} using O(logn)O(log n) bits of space and O(loglogn)O(log log n) passes, improving over the best possible space complexity achievable using a single pass.Comment: 18 pages; appearing at FOCS 201

    Lower Bounds for Sparse Recovery

    We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover x' satisfying ||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1. It is known that there exist matrices A with this property that have only O(k log (n/k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general /randomized/ version of the problem, where A is a random variable and the recovery algorithm is required to work for any fixed x with constant probability (over A).Comment: 11 pages. Appeared at SODA 201

    Stream Sampling for Frequency Cap Statistics

    Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter TT, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our \ell-capped samples provide estimates with tight statistical guarantees for cap statistics with T=Θ()T=\Theta(\ell) and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201

    External inverse pattern matching

    We consider {\sl external inverse pattern matching} problem. Given a text \t of length nn over an ordered alphabet Σ\Sigma, such that Σ=σ|\Sigma|=\sigma, and a number mnm\le n. The entire problem is to find a pattern \pe\in \Sigma^m which is not a subword of \t and which maximizes the sum of Hamming distances between \pe and all subwords of \t of length mm. We present optimal O(nlogσ)O(n\log\sigma)-time algorithm for the external inverse pattern matching problem which substantially improves the only known polynomial O(nmlogσ)O(nm\log\sigma)-time algorithm introduced by Amir, Apostolico and Lewenstein. Moreover we discuss a fast parallel implementation of our algorithm on the CREW PRAM model

    Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

    We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of mm elements from a universe of size nn, its profile is a vector ϕ\phi whose ii-th entry ϕi\phi_i represents the number of distinct elements that appear in the stream exactly ii times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry ϕi\phi_i up to an additive error of ±ϵD\pm \epsilon D using O(1/ϵ2(logn+logm))O(1/\epsilon^2 (\log n + \log m)) bits of space, where DD is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector ϕ\phi up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile ϕ^\hat\phi with the following guarantees in terms of space and estimation error: - For any constant τ\tau, with O(1/ϵ2+logn)O(1 / \epsilon^2 + \log n) bits of space, i=1τϕiϕ^iϵD\sum_{i=1}^\tau |\phi_i - \hat\phi_i| \leq \epsilon D. - With O(1/ϵ2log(1/ϵ)+logn+loglogm)O(1/ \epsilon^2\log (1/\epsilon) + \log n + \log \log m) bits of space, i=1mϕiϕ^iϵm\sum_{i=1}^m |\phi_i - \hat\phi_i| \leq \epsilon m. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on 1/ϵ1/\epsilon and those that depend on nn and mm. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error ±ϵD\pm \epsilon D of several symmetric functions of frequencies in O(1/ϵ2+logn)O(1/\epsilon^2 + \log n) bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.Comment: To appear in ITCS 202

    Cross-Sender Bit-Mixing Coding

    Scheduling to avoid packet collisions is a long-standing challenge in networking, and has become even trickier in wireless networks with multiple senders and multiple receivers. In fact, researchers have proved that even {\em perfect} scheduling can only achieve R=O(1lnN)\mathbf{R} = O(\frac{1}{\ln N}). Here NN is the number of nodes in the network, and R\mathbf{R} is the {\em medium utilization rate}. Ideally, one would hope to achieve R=Θ(1)\mathbf{R} = \Theta(1), while avoiding all the complexities in scheduling. To this end, this paper proposes {\em cross-sender bit-mixing coding} ({\em BMC}), which does not rely on scheduling. Instead, users transmit simultaneously on suitably-chosen slots, and the amount of overlap in different user's slots is controlled via coding. We prove that in all possible network topologies, using BMC enables us to achieve R=Θ(1)\mathbf{R}=\Theta(1). We also prove that the space and time complexities of BMC encoding/decoding are all low-order polynomials.Comment: Published in the International Conference on Information Processing in Sensor Networks (IPSN), 201

    Deterministic Sampling and Range Counting in Geometric Data Streams

    We present memory-efficient deterministic algorithms for constructing epsilon-nets and epsilon-approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used to answer approximate online iceberg geometric queries on data streams. We use these techniques to approximate several robust statistics of geometric data streams, including Tukey depth, simplicial depth, regression depth, the Thiel-Sen estimator, and the least median of squares. Our algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors are inverse-polylogarithmic. We also include a lower bound for non-iceberg geometric queries.Comment: 12 pages, 1 figur

    Interval Selection in the Streaming Model

    A set of intervals is independent when the intervals are pairwise disjoint. In the interval selection problem we are given a set I\mathbb{I} of intervals and we want to find an independent subset of intervals of largest cardinality. Let α(I)\alpha(\mathbb{I}) denote the cardinality of an optimal solution. We discuss the estimation of α(I)\alpha(\mathbb{I}) in the streaming model, where we only have one-time, sequential access to the input intervals, the endpoints of the intervals lie in {1,...,n}\{1,...,n \}, and the amount of the memory is constrained. For intervals of different sizes, we provide an algorithm in the data stream model that computes an estimate α^\hat\alpha of α(I)\alpha(\mathbb{I}) that, with probability at least 2/32/3, satisfies 12(1ε)α(I)α^α(I)\tfrac 12(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I}). For same-length intervals, we provide another algorithm in the data stream model that computes an estimate α^\hat\alpha of α(I)\alpha(\mathbb{I}) that, with probability at least 2/32/3, satisfies 23(1ε)α(I)α^α(I)\tfrac 23(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I}). The space used by our algorithms is bounded by a polynomial in ε1\varepsilon^{-1} and logn\log n. We also show that no better estimations can be achieved using o(n)o(n) bits of storage. We also develop new, approximate solutions to the interval selection problem, where we want to report a feasible solution, that use O(α(I))O(\alpha(\mathbb{I})) space. Our algorithms for the interval selection problem match the optimal results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval Selection, ICALP 2012], but are much simpler.Comment: Minor correction

    Lower bounds for sparse recovery

    We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover ^x satisfying x|| ^x||1 [less than or equal to] C min[subscript k]-sparse x'||x - x'||1. It is known that there exist matrices A with this property that have only O(k log(n=k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general random- ized version of the problem, where A is a random variable, and the recovery algorithm is required to work for any fixed x with constant probability (over A).