Search CORE

3,349 research outputs found

Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions

Author: Diakonikolas Ilias
Kane Daniel M.
Nikishkin Vladimir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/08/2015
Field of study

We give a general unified method that can be used for

L_1

{\em closeness testing} of a wide range of univariate structured distribution families. More specifically, we design a sample optimal and computationally efficient algorithm for testing the equivalence of two unknown (potentially arbitrary) univariate distributions under the

\mathcal{A}_k

-distance metric: Given sample access to distributions with density functions

p, q: I \to \mathbb{R}

, we want to distinguish between the cases that

p=q

and

\|p-q\|_{\mathcal{A}_k} \ge \epsilon

with probability at least

2/3

. We show that for any

k \ge 2, \epsilon>0

, the {\em optimal} sample complexity of the

\mathcal{A}_k

-closeness testing problem is

\Theta(\max\{ k^{4/5}/\epsilon^{6/5}, k^{1/2}/\epsilon^2 \})

. This is the first

o(k)

sample algorithm for this problem, and yields new, simple

L_1

closeness testers, in most cases with optimal sample complexity, for broad classes of structured distributions.Comment: 27 pages, to appear in FOCS'1

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Recovering Structured Probability Matrices

Author: Huang Qingqing
Kakade Sham M.
Kong Weihao
Valiant Gregory
Publication venue
Publication date: 01/01/2018
Field of study

We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient algorithm that accurately recovers the underlying M by M matrix using Theta(M) samples. This result easily translates to Theta(M) sample algorithms for learning topic models and learning hidden Markov Models. These linear sample complexities are optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. We provide an even stronger lower bound where distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by an HMM with two hidden states requires Omega(M) observations. This precludes sublinear-sample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Testing Shape Restrictions of Discrete Distributions

Author: Canonne Clément L.
Diakonikolas Ilias
Gouleakis Themistoklis
Rubinfeld Ronitt
Publication venue: Dagstuhl Publishing
Publication date: 01/01/2016
Field of study

We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution D over [n] and a property P, the goal is to distinguish between D in P and l_{1}(D,P)>epsilon. We develop a general algorithm for this question, which applies to a large range of "shape-constrained" properties, including monotone, log-concave, t-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly-tight upper and lower bounds for the corresponding questions

arXiv.org e-Print Archive

DSpace@MIT

Edinburgh Research Explorer

Dagstuhl Research Online Publication Server