Search CORE

sets whose union covers the most number of elements. The computation proceeds in synchronous rounds. In each round, all machines simultaneously send a message to a central coordinator who then communicates back to all machines a summary to guide the computation for the next round. At the end, the coordinator outputs the answer. The main measures of efficiency in this setting are the approximation ratio of the returned solution, the communication cost of each machine, and the number of rounds of computation. Our main result is an asymptotically tight bound on the tradeoff between these measures for the distributed maximum coverage problem. We first show that any

r

-round protocol for this problem either incurs a communication cost of

k \cdot m^{\Omega(1/r)}

or only achieves an approximation factor of

k^{\Omega(1/r)}

. This implies that any protocol that simultaneously achieves good approximation ratio (

O(1)

approximation) and good communication cost (

\widetilde{O}(n)

communication per machine), essentially requires logarithmic (in

k

) number of rounds. We complement our lower bound result by showing that there exist an

r

-round protocol that achieves an

\frac{e}{e-1}

-approximation (essentially best possible) with a communication cost of

k \cdot m^{O(1/r)}

as well as an

r

-round protocol that achieves a

k^{O(1/r)}

-approximation with only

\widetilde{O}(n)

communication per each machine (essentially best possible). We further use our results in this distributed setting to obtain new bounds for the maximum coverage problem in two other main models of computation for massive datasets, namely, the dynamic streaming model and the MapReduce model

arXiv.org e-Print Archive

Crossref

Efficient representative subset selection over sliding windows

Author: LI Yuchen
TAN Kian-Lee
WANG Yanhao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2018
Field of study

Representative subset selection (RSS) is an important tool for users to draw insights from massive datasets. Existing literature models RSS as the submodular maximization problem to capture the "diminishing returns" property of the representativeness of selected subsets, but often only has a single constraint (e.g., cardinality), which limits its applications in many real-world problems. To capture the data recency issue and support different types of constraints, we formulate dynamic RSS in data streams as maximizing submodular functions subject to general

d

-knapsack constraints (SMDK) over sliding windows. We propose a \textsc{KnapWindow} framework (KW) for SMDK. KW utilizes the \textsc{KnapStream} algorithm (KS) for SMDK in append-only streams as a subroutine. It maintains a sequence of checkpoints and KS instances over the sliding window. Theoretically, KW is

\frac{1-\varepsilon}{1+d}

-approximate for SMDK. Furthermore, we propose a \textsc{KnapWindowPlus} framework (KW

^{+}

) to improve upon KW. KW

^{+}

builds an index \textsc{SubKnapChk} to manage the checkpoints and KS instances. \textsc{SubKnapChk} deletes a checkpoint whenever it can be approximated by its successors. By keeping much fewer checkpoints, KW

^{+}

achieves higher efficiency than KW while still guaranteeing a

\frac{1-\varepsilon'}{2+2d}

-approximate solution for SMDK. Finally, we evaluate the efficiency and solution quality of KW and KW

^{+}

in real-world datasets. The experimental results demonstrate that KW achieves more than two orders of magnitude speedups over the batch baseline and preserves high-quality solutions for SMDK over sliding windows. KW

^{+}

further runs 5-10 times faster than KW while providing solutions with equivalent or even better utilities.Comment: 26 pages, 9 figures, to appear in IEEE Transactions on Knowledge and Data Engineering (TKDE). 201

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

Near Optimal Linear Algebra in the Online and Sliding Window Models

Author: Braverman Vladimir
Drineas Petros
Musco Cameron
Musco Christopher
Upadhyay Jalaj
Woodruff David P.
Zhou Samson
Publication venue
Publication date: 19/04/2020
Field of study

We initiate the study of numerical linear algebra in the sliding window model, where only the most recent

W

updates in a stream form the underlying data set. We first introduce a unified row-sampling based framework that gives randomized algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and

\ell_1

-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime. Our algorithms are based on "reverse online" versions of offline sampling distributions such as (ridge) leverage scores,

\ell_1

sensitivities, and Lewis weights to quantify both the importance and the recency of a row. Our row-sampling framework rather surprisingly implies connections to the well-studied online model; our structural results also give the first sample optimal (up to lower order terms) online algorithm for low-rank approximation/projection-cost preservation. Using this powerful primitive, we give online algorithms for column/row subset selection and principal component analysis that resolves the main open question of Bhaskara et. al.,(FOCS 2019). We also give the first online algorithm for

\ell_1

-subspace embeddings. We further formalize the connection between the online model and the sliding window model by introducing an additional unified framework for deterministic algorithms using a merge and reduce paradigm and the concept of online coresets. Our sampling based algorithms in the row-arrival online model yield online coresets, giving deterministic algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and

\ell_1

-subspace embeddings in the sliding window model that use nearly optimal space

arXiv.org e-Print Archive

Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators

Author: Woodruff David P.
Zhou Samson
Publication venue
Publication date: 23/11/2021
Field of study

In the adversarially robust streaming model, a stream of elements is presented to an algorithm and is allowed to depend on the output of the algorithm at earlier times during the stream. In the classic insertion-only model of data streams, Ben-Eliezer et. al. (PODS 2020, best paper award) show how to convert a non-robust algorithm into a robust one with a roughly

1/\varepsilon

factor overhead. This was subsequently improved to a

1/\sqrt{\varepsilon}

factor overhead by Hassidim et. al. (NeurIPS 2020, oral presentation), suppressing logarithmic factors. For general functions the latter is known to be best-possible, by a result of Kaplan et. al. (CRYPTO 2021). We show how to bypass this impossibility result by developing data stream algorithms for a large class of streaming problems, with no overhead in the approximation factor. Our class of streaming problems includes the most well-studied problems such as the

L_2

-heavy hitters problem,

F_p

-moment estimation, as well as empirical entropy estimation. We substantially improve upon all prior work on these problems, giving the first optimal dependence on the approximation factor. As in previous work, we obtain a general transformation that applies to any non-robust streaming algorithm and depends on the so-called flip number. However, the key technical innovation is that we apply the transformation to what we call a difference estimator for the streaming problem, rather than an estimator for the streaming problem itself. We then develop the first difference estimators for a wide range of problems. Our difference estimator methodology is not only applicable to the adversarially robust model, but to other streaming models where temporal properties of the data play a central role. (Abstract shortened to meet arXiv limit.)Comment: FOCS 202

arXiv.org e-Print Archive