269 research outputs found
On the Power of Adaptivity in Sparse Recovery
The goal of (stable) sparse recovery is to recover a -sparse approximation
of a vector from linear measurements of . Specifically, the goal is
to recover such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some
constant and norm parameters and . It is known that, for or
, this task can be accomplished using non-adaptive
measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11].
In this paper we show that if one is allowed to perform measurements that are
adaptive, then the number of measurements can be considerably reduced.
Specifically, for and we show - A scheme with measurements that uses
rounds. This is a significant improvement over the best possible non-adaptive
bound. - A scheme with measurements
that uses /two/ rounds. This improves over the best possible non-adaptive
bound. To the best of our knowledge, these are the first results of this type.
As an independent application, we show how to solve the problem of finding a
duplicate in a data stream of items drawn from using
bits of space and passes, improving over the best
possible space complexity achievable using a single pass.Comment: 18 pages; appearing at FOCS 201
Lower Bounds for Sparse Recovery
We consider the following k-sparse recovery problem: design an m x n matrix
A, such that for any signal x, given Ax we can efficiently recover x'
satisfying
||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1.
It is known that there exist matrices A with this property that have only O(k
log (n/k)) rows.
In this paper we show that this bound is tight. Our bound holds even for the
more general /randomized/ version of the problem, where A is a random variable
and the recovery algorithm is required to work for any fixed x with constant
probability (over A).Comment: 11 pages. Appeared at SODA 201
Stream Sampling for Frequency Cap Statistics
Unaggregated data, in streamed or distributed form, is prevalent and come
from diverse application domains which include interactions of users with web
services and IP traffic. Data elements have {\em keys} (cookies, users,
queries) and elements with different keys interleave. Analytics on such data
typically utilizes statistics stated in terms of the frequencies of keys. The
two most common statistics are {\em distinct}, which is the number of active
keys in a specified segment, and {\em sum}, which is the sum of the frequencies
of keys in the segment. Both are special cases of {\em cap} statistics, defined
as the sum of frequencies {\em capped} by a parameter , which are popular in
online advertising platforms. Aggregation by key, however, is costly, requiring
state proportional to the number of distinct keys, and therefore we are
interested in estimating these statistics or more generally, sampling the data,
without aggregation. We present a sampling framework for unaggregated data that
uses a single pass (for streams) or two passes (for distributed data) and state
proportional to the desired sample size. Our design provides the first
effective solution for general frequency cap statistics. Our -capped
samples provide estimates with tight statistical guarantees for cap statistics
with and nonnegative unbiased estimates of {\em any} monotone
non-decreasing frequency statistics. An added benefit of our unified design is
facilitating {\em multi-objective samples}, which provide estimates with
statistical guarantees for a specified set of different statistics, using a
single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201
External inverse pattern matching
We consider {\sl external inverse pattern matching} problem. Given a text \t of length over an ordered alphabet , such that , and a number . The entire problem is to find a pattern \pe\in \Sigma^m which is not a subword of \t and which maximizes the sum of Hamming distances between \pe and all subwords of \t of length . We present optimal -time algorithm for the external inverse pattern matching problem which substantially improves the only known polynomial -time algorithm introduced by Amir, Apostolico and Lewenstein. Moreover we discuss a fast parallel implementation of our algorithm on the CREW PRAM model
Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions
We revisit the problem of estimating the profile (also known as the rarity)
in the data stream model. Given a sequence of elements from a universe of
size , its profile is a vector whose -th entry represents
the number of distinct elements that appear in the stream exactly times. A
classic paper by Datar and Muthukrishan from 2002 gave an algorithm which
estimates any entry up to an additive error of using
bits of space, where is the number of
distinct elements in the stream. In this paper, we considerably improve on this
result by designing an algorithm which simultaneously estimates many
coordinates of the profile vector up to small overall error. We give an
algorithm which, with constant probability, produces an estimated profile
with the following guarantees in terms of space and estimation
error:
- For any constant , with bits of space,
.
- With bits of
space, .
In addition to bounding the error across multiple coordinates, our space
bounds separate the terms that depend on and those that depend on
and . We prove matching lower bounds on space in both regimes.
Application of our profile estimation algorithm gives estimates within error
of several symmetric functions of frequencies in
bits. This generalizes space-optimal algorithms for
the distinct elements problems to other problems including estimating the Huber
and Tukey losses as well as frequency cap statistics.Comment: To appear in ITCS 202
Cross-Sender Bit-Mixing Coding
Scheduling to avoid packet collisions is a long-standing challenge in
networking, and has become even trickier in wireless networks with multiple
senders and multiple receivers. In fact, researchers have proved that even {\em
perfect} scheduling can only achieve . Here
is the number of nodes in the network, and is the {\em medium
utilization rate}. Ideally, one would hope to achieve ,
while avoiding all the complexities in scheduling. To this end, this paper
proposes {\em cross-sender bit-mixing coding} ({\em BMC}), which does not rely
on scheduling. Instead, users transmit simultaneously on suitably-chosen slots,
and the amount of overlap in different user's slots is controlled via coding.
We prove that in all possible network topologies, using BMC enables us to
achieve . We also prove that the space and time
complexities of BMC encoding/decoding are all low-order polynomials.Comment: Published in the International Conference on Information Processing
in Sensor Networks (IPSN), 201
Deterministic Sampling and Range Counting in Geometric Data Streams
We present memory-efficient deterministic algorithms for constructing
epsilon-nets and epsilon-approximations of streams of geometric data. Unlike
probabilistic approaches, these deterministic samples provide guaranteed bounds
on their approximation factors. We show how our deterministic samples can be
used to answer approximate online iceberg geometric queries on data streams. We
use these techniques to approximate several robust statistics of geometric data
streams, including Tukey depth, simplicial depth, regression depth, the
Thiel-Sen estimator, and the least median of squares. Our algorithms use only a
polylogarithmic amount of memory, provided the desired approximation factors
are inverse-polylogarithmic. We also include a lower bound for non-iceberg
geometric queries.Comment: 12 pages, 1 figur
Interval Selection in the Streaming Model
A set of intervals is independent when the intervals are pairwise disjoint.
In the interval selection problem we are given a set of intervals
and we want to find an independent subset of intervals of largest cardinality.
Let denote the cardinality of an optimal solution. We
discuss the estimation of in the streaming model, where we
only have one-time, sequential access to the input intervals, the endpoints of
the intervals lie in , and the amount of the memory is
constrained.
For intervals of different sizes, we provide an algorithm in the data stream
model that computes an estimate of that, with
probability at least , satisfies . For same-length
intervals, we provide another algorithm in the data stream model that computes
an estimate of that, with probability at
least , satisfies . The space used by our algorithms is bounded
by a polynomial in and . We also show that no better
estimations can be achieved using bits of storage.
We also develop new, approximate solutions to the interval selection problem,
where we want to report a feasible solution, that use
space. Our algorithms for the interval selection problem match the optimal
results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval
Selection, ICALP 2012], but are much simpler.Comment: Minor correction
Lower bounds for sparse recovery
We consider the following k-sparse recovery problem:
design an m x n matrix A, such that for any signal
x, given Ax we can efficiently recover ^x satisfying
x|| ^x||1 [less than or equal to] C min[subscript k]-sparse x'||x - x'||1. It is known that there exist matrices A with this property that have only O(k log(n=k)) rows.
In this paper we show that this bound is tight.
Our bound holds even for the more general random-
ized version of the problem, where A is a random
variable, and the recovery algorithm is required to
work for any fixed x with constant probability (over
A).David & Lucile Packard FoundationDanish National Research FoundationDanish National Research Foundation (MADALGO (Center for Massive Data Algorithmics))National Science Foundation (U.S.) (grant CCF-0728645)Cisco Community Fellowship Progra
- …