92 research outputs found
Towards Optimal Moment Estimation in Streaming and Distributed Models
One of the oldest problems in the data stream model is to approximate the p-th moment ||X||_p^p = sum_{i=1}^n X_i^p of an underlying non-negative vector X in R^n, which is presented as a sequence of poly(n) updates to its coordinates. Of particular interest is when p in (0,2]. Although a tight space bound of Theta(epsilon^-2 log n) bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is O(epsilon^-2 log n) bits, while the lower bound is only Omega(epsilon^-2 + log n) bits. Recently, an upper bound of O~(epsilon^-2 + log n) bits was obtained under the assumption that the updates arrive in a random order.
We show that for p in (0, 1], the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of O~(epsilon^-2 + log n) bits for estimating |X |_p^p. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for p in (1,2], in the natural coordinator and blackboard distributed communication topologies, there is an O~(epsilon^-2) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an O~(epsilon^2 log d) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an Omega(epsilon^-2 log n) bit lower bound for p in (1,2] for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter
An Optimal Algorithm for Triangle Counting in the Stream
We present a new algorithm for approximating the number of triangles in a graph G whose edges arrive as an arbitrary order stream. If m is the number of edges in G, T the number of triangles, ?_E the maximum number of triangles which share a single edge, and ?_V the maximum number of triangles which share a single vertex, then our algorithm requires space:
O?(m/T?(?_E + ?{?_V}))
Taken with the ?((m ?_E)/T) lower bound of Braverman, Ostrovsky, and Vilenchik (ICALP 2013), and the ?((m ?{?_V})/T) lower bound of Kallaugher and Price (SODA 2017), our algorithm is optimal up to log factors, resolving the complexity of a classic problem in graph streaming
Approximating Language Edit Distance Beyond Fast Matrix Multiplication: Ultralinear Grammars Are Where Parsing Becomes Hard!
In 1975, a breakthrough result of L. Valiant showed that parsing context free grammars can be reduced to Boolean matrix multiplication, resulting in a running time of O(n^omega) for parsing where omega <= 2.373 is the exponent of fast matrix multiplication, and n is the string length. Recently, Abboud, Backurs and V. Williams (FOCS 2015) demonstrated that this is likely optimal; moreover, a combinatorial o(n^3) algorithm is unlikely to exist for the general parsing problem. The language edit distance problem is a significant generalization of the parsing problem, which computes the minimum edit distance of a given string (using insertions, deletions, and substitutions) to any valid string in the language, and has received significant attention both in theory and practice since the seminal work of Aho and Peterson in 1972. Clearly, the lower bound for parsing rules out any algorithm running in o(n^omega) time that can return a nontrivial multiplicative approximation of the language edit distance problem. Furthermore, combinatorial algorithms with cubic running time or algorithms that use fast matrix multiplication are often not desirable in practice.
To break this n^omega hardness barrier, in this paper we study additive approximation algorithms for language edit distance. We provide two explicit combinatorial algorithms to obtain a string with minimum edit distance with performance dependencies on either the number of non-linear productions, k^*, or the number of nested non-linear production, k, used in the optimal derivation. Explicitly, we give an additive O(k^*gamma) approximation in time O(|G|(n^2 + (n/gamma)^3)) and an additive O(k gamma) approximation in time O(|G|(n^2 + (n^3/gamma^2))), where |G| is the grammar size and n is the string length. In particular, we obtain tight approximations for an important subclass of context free grammars known as ultralinear grammars, for which k and k^* are naturally bounded. Interestingly, we show that the same conditional lower bound for parsing context free grammars holds for the class of ultralinear grammars as well, clearly marking the boundary where parsing becomes hard
Weighted Reservoir Sampling from Distributed Streams
We consider message-efficient continuous random sampling from a distributed
stream, where the probability of inclusion of an item in the sample is
proportional to a weight associated with the item. The unweighted version,
where all weights are equal, is well studied, and admits tight upper and lower
bounds on message complexity. For weighted sampling with replacement, there is
a simple reduction to unweighted sampling with replacement. However, in many
applications the stream has only a few heavy items which may dominate a random
sample when chosen with replacement. Weighted sampling \textit{without
replacement} (weighted SWOR) eludes this issue, since such heavy items can be
sampled at most once.
In this work, we present the first message-optimal algorithm for weighted
SWOR from a distributed stream. Our algorithm also has optimal space and time
complexity. As an application of our algorithm for weighted SWOR, we derive the
first distributed streaming algorithms for tracking \textit{heavy hitters with
residual error}. Here the goal is to identify stream items that contribute
significantly to the residual stream, once the heaviest items are removed.
Residual heavy hitters generalize the notion of heavy hitters and are
important in streams that have a skewed distribution of weights. In addition to
the upper bound, we also provide a lower bound on the message complexity that
is nearly tight up to a factor. Finally, we use our weighted
sampling algorithm to improve the message complexity of distributed
tracking, also known as count tracking, which is a widely studied problem in
distributed streaming. We also derive a tight message lower bound, which closes
the message complexity of this fundamental problem.Comment: To appear in PODS 201
Metric Clustering and MST with Strong and Weak Distance Oracles
We study optimization problems in a metric space where we
can compute distances in two ways: via a ''strong'' oracle that returns exact
distances , and a ''weak'' oracle that returns distances
which may be arbitrarily corrupted with some probability. This
model captures the increasingly common trade-off between employing both an
expensive similarity model (e.g. a large-scale embedding model), and a less
accurate but cheaper model. Hence, the goal is to make as few queries to the
strong oracle as possible. We consider both so-called ''point queries'', where
the strong oracle is queried on a set of points and
returns for all , and ''edge queries'' where it is queried
for individual distances .
Our main contributions are optimal algorithms and lower bounds for clustering
and Minimum Spanning Tree (MST) in this model. For -centers, -median, and
-means, we give constant factor approximation algorithms with only
strong oracle point queries, and prove that queries
are required for any bounded approximation. For edge queries, our upper and
lower bounds are both . Surprisingly, for the MST problem
we give a approximation algorithm using no strong oracle
queries at all, and a matching lower bound. We
empirically evaluate our algorithms, and show that their quality is comparable
to that of the baseline algorithms that are given all true distances, but while
querying the strong oracle on only a small fraction () of points
- …