78 research outputs found

    Deterministic Sampling and Range Counting in Geometric Data Streams

    Get PDF
    We present memory-efficient deterministic algorithms for constructing epsilon-nets and epsilon-approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used to answer approximate online iceberg geometric queries on data streams. We use these techniques to approximate several robust statistics of geometric data streams, including Tukey depth, simplicial depth, regression depth, the Thiel-Sen estimator, and the least median of squares. Our algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors are inverse-polylogarithmic. We also include a lower bound for non-iceberg geometric queries.Comment: 12 pages, 1 figur

    Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

    Get PDF
    In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join

    Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators

    Full text link
    In the adversarially robust streaming model, a stream of elements is presented to an algorithm and is allowed to depend on the output of the algorithm at earlier times during the stream. In the classic insertion-only model of data streams, Ben-Eliezer et. al. (PODS 2020, best paper award) show how to convert a non-robust algorithm into a robust one with a roughly 1/ε1/\varepsilon factor overhead. This was subsequently improved to a 1/ε1/\sqrt{\varepsilon} factor overhead by Hassidim et. al. (NeurIPS 2020, oral presentation), suppressing logarithmic factors. For general functions the latter is known to be best-possible, by a result of Kaplan et. al. (CRYPTO 2021). We show how to bypass this impossibility result by developing data stream algorithms for a large class of streaming problems, with no overhead in the approximation factor. Our class of streaming problems includes the most well-studied problems such as the L2L_2-heavy hitters problem, FpF_p-moment estimation, as well as empirical entropy estimation. We substantially improve upon all prior work on these problems, giving the first optimal dependence on the approximation factor. As in previous work, we obtain a general transformation that applies to any non-robust streaming algorithm and depends on the so-called flip number. However, the key technical innovation is that we apply the transformation to what we call a difference estimator for the streaming problem, rather than an estimator for the streaming problem itself. We then develop the first difference estimators for a wide range of problems. Our difference estimator methodology is not only applicable to the adversarially robust model, but to other streaming models where temporal properties of the data play a central role. (Abstract shortened to meet arXiv limit.)Comment: FOCS 202

    Doctor of Philosophy

    Get PDF
    dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as kk distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear (o(n)o(n)) sized coresets for graphs that have polynomial (i.e., nδn^{\delta} for any δ>0\delta >0) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a (1+ϵ)(1+\epsilon) factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set, kk-centers and kk-slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify kk (approximate) eigenvectors of an n×nn \times n integer matrix AA. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space
    • …
    corecore