12 research outputs found
Semantic and influence aware k-representative queries over social streams
Ministry of Education, Singapore under its Academic Research Funding Tier
Tight Bounds on the Round Complexity of the Distributed Maximum Coverage Problem
We study the maximum -set coverage problem in the following distributed
setting. A collection of sets over a universe is
partitioned across machines and the goal is to find sets whose union
covers the most number of elements. The computation proceeds in synchronous
rounds. In each round, all machines simultaneously send a message to a central
coordinator who then communicates back to all machines a summary to guide the
computation for the next round. At the end, the coordinator outputs the answer.
The main measures of efficiency in this setting are the approximation ratio of
the returned solution, the communication cost of each machine, and the number
of rounds of computation.
Our main result is an asymptotically tight bound on the tradeoff between
these measures for the distributed maximum coverage problem. We first show that
any -round protocol for this problem either incurs a communication cost of or only achieves an approximation factor of
. This implies that any protocol that simultaneously achieves
good approximation ratio ( approximation) and good communication cost
( communication per machine), essentially requires
logarithmic (in ) number of rounds. We complement our lower bound result by
showing that there exist an -round protocol that achieves an
-approximation (essentially best possible) with a communication
cost of as well as an -round protocol that achieves a
-approximation with only communication per each
machine (essentially best possible).
We further use our results in this distributed setting to obtain new bounds
for the maximum coverage problem in two other main models of computation for
massive datasets, namely, the dynamic streaming model and the MapReduce model
Efficient representative subset selection over sliding windows
Representative subset selection (RSS) is an important tool for users to draw
insights from massive datasets. Existing literature models RSS as the
submodular maximization problem to capture the "diminishing returns" property
of the representativeness of selected subsets, but often only has a single
constraint (e.g., cardinality), which limits its applications in many
real-world problems. To capture the data recency issue and support different
types of constraints, we formulate dynamic RSS in data streams as maximizing
submodular functions subject to general -knapsack constraints (SMDK) over
sliding windows. We propose a \textsc{KnapWindow} framework (KW) for SMDK. KW
utilizes the \textsc{KnapStream} algorithm (KS) for SMDK in append-only streams
as a subroutine. It maintains a sequence of checkpoints and KS instances over
the sliding window. Theoretically, KW is
-approximate for SMDK. Furthermore, we propose a
\textsc{KnapWindowPlus} framework (KW) to improve upon KW. KW
builds an index \textsc{SubKnapChk} to manage the checkpoints and KS instances.
\textsc{SubKnapChk} deletes a checkpoint whenever it can be approximated by its
successors. By keeping much fewer checkpoints, KW achieves higher
efficiency than KW while still guaranteeing a
-approximate solution for SMDK. Finally, we
evaluate the efficiency and solution quality of KW and KW in real-world
datasets. The experimental results demonstrate that KW achieves more than two
orders of magnitude speedups over the batch baseline and preserves high-quality
solutions for SMDK over sliding windows. KW further runs 5-10 times
faster than KW while providing solutions with equivalent or even better
utilities.Comment: 26 pages, 9 figures, to appear in IEEE Transactions on Knowledge and
Data Engineering (TKDE). 201
Near Optimal Linear Algebra in the Online and Sliding Window Models
We initiate the study of numerical linear algebra in the sliding window
model, where only the most recent updates in a stream form the underlying
data set. We first introduce a unified row-sampling based framework that gives
randomized algorithms for spectral approximation, low-rank
approximation/projection-cost preservation, and -subspace embeddings in
the sliding window model, which often use nearly optimal space and achieve
nearly input sparsity runtime. Our algorithms are based on "reverse online"
versions of offline sampling distributions such as (ridge) leverage scores,
sensitivities, and Lewis weights to quantify both the importance and
the recency of a row. Our row-sampling framework rather surprisingly implies
connections to the well-studied online model; our structural results also give
the first sample optimal (up to lower order terms) online algorithm for
low-rank approximation/projection-cost preservation. Using this powerful
primitive, we give online algorithms for column/row subset selection and
principal component analysis that resolves the main open question of Bhaskara
et. al.,(FOCS 2019). We also give the first online algorithm for
-subspace embeddings. We further formalize the connection between the
online model and the sliding window model by introducing an additional unified
framework for deterministic algorithms using a merge and reduce paradigm and
the concept of online coresets. Our sampling based algorithms in the
row-arrival online model yield online coresets, giving deterministic algorithms
for spectral approximation, low-rank approximation/projection-cost
preservation, and -subspace embeddings in the sliding window model that
use nearly optimal space
Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators
In the adversarially robust streaming model, a stream of elements is
presented to an algorithm and is allowed to depend on the output of the
algorithm at earlier times during the stream. In the classic insertion-only
model of data streams, Ben-Eliezer et. al. (PODS 2020, best paper award) show
how to convert a non-robust algorithm into a robust one with a roughly
factor overhead. This was subsequently improved to a
factor overhead by Hassidim et. al. (NeurIPS 2020, oral
presentation), suppressing logarithmic factors. For general functions the
latter is known to be best-possible, by a result of Kaplan et. al. (CRYPTO
2021). We show how to bypass this impossibility result by developing data
stream algorithms for a large class of streaming problems, with no overhead in
the approximation factor. Our class of streaming problems includes the most
well-studied problems such as the -heavy hitters problem, -moment
estimation, as well as empirical entropy estimation. We substantially improve
upon all prior work on these problems, giving the first optimal dependence on
the approximation factor.
As in previous work, we obtain a general transformation that applies to any
non-robust streaming algorithm and depends on the so-called flip number.
However, the key technical innovation is that we apply the transformation to
what we call a difference estimator for the streaming problem, rather than an
estimator for the streaming problem itself. We then develop the first
difference estimators for a wide range of problems. Our difference estimator
methodology is not only applicable to the adversarially robust model, but to
other streaming models where temporal properties of the data play a central
role. (Abstract shortened to meet arXiv limit.)Comment: FOCS 202