Search CORE

269 research outputs found

On the Power of Adaptivity in Sparse Recovery

Author: Indyk Piotr
Price Eric
Woodruff David P.
Publication venue
Publication date: 01/01/2011
Field of study

The goal of (stable) sparse recovery is to recover a

k

-sparse approximation

x*

of a vector

x

from linear measurements of

x

. Specifically, the goal is to recover

x*

such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant

C

and norm parameters

p

and

q

. It is known that, for

p=q=1

p=q=2

, this task can be accomplished using

m=O(k \log (n/k))

non-adaptive measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11]. In this paper we show that if one is allowed to perform measurements that are adaptive, then the number of measurements can be considerably reduced. Specifically, for

C=1+eps

and

p=q=2

we show - A scheme with

m=O((1/eps)k log log (n eps/k))

measurements that uses

O(log* k \log \log (n eps/k))

rounds. This is a significant improvement over the best possible non-adaptive bound. - A scheme with

m=O((1/eps) k log (k/eps) + k \log (n/k))

measurements that uses /two/ rounds. This improves over the best possible non-adaptive bound. To the best of our knowledge, these are the first results of this type. As an independent application, we show how to solve the problem of finding a duplicate in a data stream of

n

items drawn from

{1, 2, ..., n-1}

using

O(log n)

bits of space and

O(log log n)

passes, improving over the best possible space complexity achievable using a single pass.Comment: 18 pages; appearing at FOCS 201

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Lower Bounds for Sparse Recovery

Author: Ba Khanh Do
Indyk Piotr
Price Eric
Woodruff David P.
Publication venue
Publication date: 01/01/2010
Field of study

We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover x' satisfying ||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1. It is known that there exist matrices A with this property that have only O(k log (n/k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general /randomized/ version of the problem, where A is a random variable and the recovery algorithm is required to work for any fixed x with constant probability (over A).Comment: 11 pages. Appeared at SODA 201

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Stream Sampling for Frequency Cap Statistics

Author: Cohen E.
Gemulla R.
Indyk P.
Johnson W.
Misra J.
Ohlsson E.
Publication venue
Publication date: 28/06/2015
Field of study

Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter

T

, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our

\ell

-capped samples provide estimates with tight statistical guarantees for cap statistics with

T=\Theta(\ell)

and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201

arXiv.org e-Print Archive

Crossref

External inverse pattern matching

Author: Gasieniec L.
Indyk P.
Krysta P.
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/1996
Field of study

We consider {\sl external inverse pattern matching} problem. Given a text \t of length

n

over an ordered alphabet

\Sigma

, such that

|\Sigma|=\sigma

, and a number

m\le n

. The entire problem is to find a pattern \pe\in \Sigma^m which is not a subword of \t and which maximizes the sum of Hamming distances between \pe and all subwords of \t of length

m

. We present optimal

O(n\log\sigma)

-time algorithm for the external inverse pattern matching problem which substantially improves the only known polynomial

O(nm\log\sigma)

-time algorithm introduced by Amir, Apostolico and Lewenstein. Moreover we discuss a fast parallel implementation of our algorithm on the CREW PRAM model

MPG.PuRe

Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

Author: Chen Justin Y.
Indyk Piotr
Woodruff David P.
Publication venue
Publication date: 29/11/2023
Field of study

We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of

m

elements from a universe of size

n

, its profile is a vector

\phi

whose

i

-th entry

\phi_i

represents the number of distinct elements that appear in the stream exactly

i

times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry

\phi_i

up to an additive error of

\pm \epsilon D

using

O(1/\epsilon^2 (\log n + \log m))

bits of space, where

D

is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector

\phi

up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile

\hat\phi

with the following guarantees in terms of space and estimation error: - For any constant

\tau

, with

O(1 / \epsilon^2 + \log n)

bits of space,

\sum_{i=1}^\tau |\phi_i - \hat\phi_i| \leq \epsilon D

. - With

O(1/ \epsilon^2\log (1/\epsilon) + \log n + \log \log m)

bits of space,

\sum_{i=1}^m |\phi_i - \hat\phi_i| \leq \epsilon m

. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on

1/\epsilon

and those that depend on

n

and

m

. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error

\pm \epsilon D

of several symmetric functions of frequencies in

O(1/\epsilon^2 + \log n)

bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.Comment: To appear in ITCS 202

arXiv.org e-Print Archive

Cross-Sender Bit-Mixing Coding

Author: Chan C.
Cheraghchi M.
Du D.
Gilbert A.
Gilbert A.
Inan H.
Indyk P.
Lee K.
Mazumdar A.
Mazumdar Arya
Publication venue
Publication date: 01/01/2019
Field of study

Scheduling to avoid packet collisions is a long-standing challenge in networking, and has become even trickier in wireless networks with multiple senders and multiple receivers. In fact, researchers have proved that even {\em perfect} scheduling can only achieve

\mathbf{R} = O(\frac{1}{\ln N})

. Here

N

is the number of nodes in the network, and

\mathbf{R}

is the {\em medium utilization rate}. Ideally, one would hope to achieve

\mathbf{R} = \Theta(1)

, while avoiding all the complexities in scheduling. To this end, this paper proposes {\em cross-sender bit-mixing coding} ({\em BMC}), which does not rely on scheduling. Instead, users transmit simultaneously on suitably-chosen slots, and the amount of overlap in different user's slots is controlled via coding. We prove that in all possible network topologies, using BMC enables us to achieve

\mathbf{R}=\Theta(1)

. We also prove that the space and time complexities of BMC encoding/decoding are all low-order polynomials.Comment: Published in the International Conference on Information Processing in Sensor Networks (IPSN), 201

arXiv.org e-Print Archive

Crossref

NORA - Norwegian Open Research Archives

ScholarBank@NUS

Deterministic Sampling and Range Counting in Geometric Data Streams

Author: Amitabh Chaudhary
Amitabha Bagchi
Cormode G.
Datar M.
David Eppstein
Demaine E. D.
Fang M.
Feigenbaum J.
Gupta A.
Har-Peled S.
Hershberger J.
Indyk P.
Indyk P.
Korn F.
Langerman S.
Manku G. S.
Matoušek J.
Matoušek J.
Matoušek J.
Michael T. Goodrich
Rousseeuw P. J.
Thiel H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/07/2003
Field of study

We present memory-efficient deterministic algorithms for constructing epsilon-nets and epsilon-approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used to answer approximate online iceberg geometric queries on data streams. We use these techniques to approximate several robust statistics of geometric data streams, including Tukey depth, simplicial depth, regression depth, the Thiel-Sen estimator, and the least median of squares. Our algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors are inverse-polylogarithmic. We also include a lower bound for non-iceberg geometric queries.Comment: 12 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Crossref

Interval Selection in the Streaming Model

Author: AW Kolen
AZ Broder
BV Halldórsson
DS Hochbaum
E Kushilevitz
J Feigenbaum
M Datar
P Indyk
TS Jayram
Y Emek
Publication venue
Publication date: 04/02/2015
Field of study

A set of intervals is independent when the intervals are pairwise disjoint. In the interval selection problem we are given a set

\mathbb{I}

of intervals and we want to find an independent subset of intervals of largest cardinality. Let

\alpha(\mathbb{I})

denote the cardinality of an optimal solution. We discuss the estimation of

\alpha(\mathbb{I})

in the streaming model, where we only have one-time, sequential access to the input intervals, the endpoints of the intervals lie in

\{1,...,n \}

, and the amount of the memory is constrained. For intervals of different sizes, we provide an algorithm in the data stream model that computes an estimate

\hat\alpha

\alpha(\mathbb{I})

that, with probability at least

2/3

, satisfies

\tfrac 12(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I})

. For same-length intervals, we provide another algorithm in the data stream model that computes an estimate

\hat\alpha

\alpha(\mathbb{I})

that, with probability at least

2/3

, satisfies

\tfrac 23(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I})

. The space used by our algorithms is bounded by a polynomial in

\varepsilon^{-1}

and

\log n

. We also show that no better estimations can be achieved using

o(n)

bits of storage. We also develop new, approximate solutions to the interval selection problem, where we want to report a feasible solution, that use

O(\alpha(\mathbb{I}))

space. Our algorithms for the interval selection problem match the optimal results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval Selection, ICALP 2012], but are much simpler.Comment: Minor correction

arXiv.org e-Print Archive

Crossref

Lower bounds for sparse recovery

Author: Do Ba Khanh
Indyk Piotr
Price Eric C.
Woodruff David P.
Publication venue: 'The Japan Society for Industrial and Applied Mathematics'
Publication date: 01/01/2010
Field of study

We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover ^x satisfying x|| ^x||1 [less than or equal to] C min[subscript k]-sparse x'||x - x'||1. It is known that there exist matrices A with this property that have only O(k log(n=k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general random- ized version of the problem, where A is a random variable, and the recovery algorithm is required to work for any fixed x with constant probability (over A).David & Lucile Packard FoundationDanish National Research FoundationDanish National Research Foundation (MADALGO (Center for Massive Data Algorithmics))National Science Foundation (U.S.) (grant CCF-0728645)Cisco Community Fellowship Progra

DSpace@MIT