78 research outputs found
Deterministic Sampling and Range Counting in Geometric Data Streams
We present memory-efficient deterministic algorithms for constructing
epsilon-nets and epsilon-approximations of streams of geometric data. Unlike
probabilistic approaches, these deterministic samples provide guaranteed bounds
on their approximation factors. We show how our deterministic samples can be
used to answer approximate online iceberg geometric queries on data streams. We
use these techniques to approximate several robust statistics of geometric data
streams, including Tukey depth, simplicial depth, regression depth, the
Thiel-Sen estimator, and the least median of squares. Our algorithms use only a
polylogarithmic amount of memory, provided the desired approximation factors
are inverse-polylogarithmic. We also include a lower bound for non-iceberg
geometric queries.Comment: 12 pages, 1 figur
Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space
In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match.
Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|.
This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join
Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators
In the adversarially robust streaming model, a stream of elements is
presented to an algorithm and is allowed to depend on the output of the
algorithm at earlier times during the stream. In the classic insertion-only
model of data streams, Ben-Eliezer et. al. (PODS 2020, best paper award) show
how to convert a non-robust algorithm into a robust one with a roughly
factor overhead. This was subsequently improved to a
factor overhead by Hassidim et. al. (NeurIPS 2020, oral
presentation), suppressing logarithmic factors. For general functions the
latter is known to be best-possible, by a result of Kaplan et. al. (CRYPTO
2021). We show how to bypass this impossibility result by developing data
stream algorithms for a large class of streaming problems, with no overhead in
the approximation factor. Our class of streaming problems includes the most
well-studied problems such as the -heavy hitters problem, -moment
estimation, as well as empirical entropy estimation. We substantially improve
upon all prior work on these problems, giving the first optimal dependence on
the approximation factor.
As in previous work, we obtain a general transformation that applies to any
non-robust streaming algorithm and depends on the so-called flip number.
However, the key technical innovation is that we apply the transformation to
what we call a difference estimator for the streaming problem, rather than an
estimator for the streaming problem itself. We then develop the first
difference estimators for a wide range of problems. Our difference estimator
methodology is not only applicable to the adversarially robust model, but to
other streaming models where temporal properties of the data play a central
role. (Abstract shortened to meet arXiv limit.)Comment: FOCS 202
Doctor of Philosophy
dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear () sized coresets for graphs that have polynomial (i.e., for any ) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set, -centers and -slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify (approximate) eigenvectors of an integer matrix . In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space
Recommended from our members
Combinatorics
Combinatorics is a fundamental mathematical discipline which focuses on the study of discrete objects and their properties. The current workshop brought together researchers from diverse fields such as Extremal and Probabilistic Combinatorics, Discrete Geometry, Graph theory, Combiantorial Optimization and Algebraic Combinatorics for a fruitful interaction. New results, methods and developments and future challenges were discussed. This is a report on the meeting containing abstracts of the presentations and a summary of the problem session
- …