54 research outputs found

    Testing +/- 1-Weight Halfspaces

    Get PDF
    We consider the problem of testing whether a Boolean function f:{β€‰βˆ’β€‰1,1} [superscript n] β†’{β€‰βˆ’β€‰1,1} is a Β±1-weight halfspace, i.e. a function of the form f(x) = sgn(w [subscript 1] x [subscript 1] + w [subscript 2] x [subscript 2 ]+ ⋯ + w [subscript n] x [subscript n] ) where the weights w i take values in {β€‰βˆ’β€‰1,1}. We show that the complexity of this problem is markedly different from the problem of testing whether f is a general halfspace with arbitrary weights. While the latter can be done with a number of queries that is independent of n [7], to distinguish whether f is a Β±-weight halfspace versus Ξ΅-far from all such halfspaces we prove that nonadaptive algorithms must make Ξ©(logn) queries. We complement this lower bound with a sublinear upper bound showing that O(nβ‹…O(\sqrt{n}\cdot poly(1Ο΅))(\frac{1}{\epsilon})) queries suffice

    Testing k-wise independent distributions

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 119-123).A probability distribution over {0, 1}' is k-wise independent if its restriction to any k coordinates is uniform. More generally, a discrete distribution D over E1 x ... x E, is called (non-uniform) k-wise independent if for any subset of k indices {ii, . . . , ik} and for any zi E Ei 1, .. , Zk E Eik , PrX~D [Xi 1 - - -Xi, = Z1 .. z] = PrX-D[Xi 1 = zi] ... PrX~D [Xik = Zk]. k-wise independent distributions look random "locally" to an observer of only k coordinates, even though they may be far from random "globally". Because of this key feature, k-wise independent distributions are important concepts in probability, complexity, and algorithm design. In this thesis, we study the problem of testing (non-uniform) k-wise independent distributions over product spaces. For the problem of distinguishing k-wise independent distributions supported on the Boolean cube from those that are 6-far in statistical distance from any k-wise independent distribution, we upper bound the number of required samples by O(nk/6 2 ) and lower bound it by Q (n 2 /6) (these bounds hold for constant k, and essentially the same bounds hold for general k). To achieve these bounds, we use novel Fourier analysis techniques to relate a distribution's statistical distance from k-wise independence to its biases, a measure of the parity imbalance it induces on a set of variables. The relationships we derive are tighter than previously known, and may be of independent interest. We then generalize our results to distributions over larger domains. For the uniform case we show an upper bound on the distance between a distribution D from k-wise independent distributions in terms of the sum of Fourier coefficients of D at vectors of weight at most k. For the non-uniform case, we give a new characterization of distributions being k-wise independent and further show that such a characterization is robust based on our results for the uniform case. Our results yield natural testing algorithms for k-wise independence with time and sample complexity sublinear in terms of the support size of the distribution when k is a constant. The main technical tools employed include discrete Fourier transform and the theory of linear systems of congruences.by Ning Xie.Ph.D

    Some Communication Complexity Results and their Applications

    Get PDF
    Communication Complexity represents one of the premier techniques for proving lower bounds in theoretical computer science. Lower bounds on communication problems can be leveraged to prove lower bounds in several different areas. In this work, we study three different communication complexity problems. The lower bounds for these problems have applications in circuit complexity, wireless sensor networks, and streaming algorithms. First, we study the multiparty pointer jumping problem. We present the first nontrivial upper bound for this problem. We also provide a suite of strong lower bounds under several restricted classes of protocols. Next, we initiate the study of several non-monotone functions in the distributed functional monitoring setting and provide several lower bounds. In particular, we give a generic adversarial technique and show that when deletions are allowed, no nontrivial protocol is possible. Finally, we study the Gap-Hamming-Distance problem and give tight lower bounds for protocols that use a constant number of messages. As a result, we take a well-known lower bound for one-pass streaming algorithms for a host of problems and extend it so it applies to streaming algorithms that use a constant number of passes

    Analyzing massive datasets with missing entries: models and algorithms

    Get PDF
    We initiate a systematic study of computational models to analyze algorithms for massive datasets with missing or erased entries and study the relationship of our models with existing algorithmic models for large datasets. We focus on algorithms whose inputs are naturally represented as functions, codewords, or graphs. First, we generalize the property testing model, one of the most widely studied models of sublinear-time algorithms, to account for the presence of adversarially erased function values. We design efficient erasure-resilient property testing algorithms for several fundamental properties of real-valued functions such as monotonicity, Lipschitz property, convexity, and linearity. We then investigate the problems of local decoding and local list decoding of codewords containing erasures. We show that, in some cases, these problems are strictly easier than the corresponding problems of decoding codewords containing errors. Moreover, we use this understanding to show a separation between our erasure-resilient property testing model and the (error) tolerant property testing model. The philosophical message of this separation is that errors occurring in large datasets are, in general, harder to deal with, than erasures. Finally, we develop models and notions to reason about algorithms that are intended to run on large graphs with missing edges. While running algorithms on large graphs containing several missing edges, it is desirable to output solutions that are close to the solutions output when there are no missing edges. With this motivation, we define average sensitivity, a robustness metric for graph algorithms. We discuss various useful features of our definition and design approximation algorithms with good average sensitivity bounds for several optimization problems on graphs. We also define a model of erasure-resilient sublinear-time graph algorithms and design an efficient algorithm for testing connectivity of graphs

    Online Learning in Dynamically Changing Environments

    Full text link
    We study the problem of online learning and online regret minimization when samples are drawn from a general unknown non-stationary process. We introduce the concept of a dynamic changing process with cost KK, where the conditional marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by KK over TT rounds. For such processes we prove a tight (upto log⁑T\sqrt{\log T} factor) bound O(KTβ‹…VC(H)log⁑T)O(\sqrt{KT\cdot\mathsf{VC}(\mathcal{H})\log T}) for the expected worst case regret of any finite VC-dimensional class H\mathcal{H} under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to log⁑3T\log^3 T factor) regret bound O(Kβ‹…VC(H)log⁑3T)O(K\cdot\mathsf{VC}(\mathcal{H})\log^3 T). We extend these results to general smooth adversary processes with unknown reference measure by showing a sub-linear regret bound for 11-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the distribution blind (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.Comment: Submitte

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Doctor of Philosophy

    Get PDF
    dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as kk distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear (o(n)o(n)) sized coresets for graphs that have polynomial (i.e., nΞ΄n^{\delta} for any Ξ΄>0\delta >0) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a (1+Ο΅)(1+\epsilon) factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set, kk-centers and kk-slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify kk (approximate) eigenvectors of an nΓ—nn \times n integer matrix AA. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space
    • …
    corecore