4,015 research outputs found
Phase Transitions in the Pooled Data Problem
In this paper, we study the pooled data problem of identifying the labels
associated with a large collection of items, based on a sequence of pooled
tests revealing the counts of each label within the pool. In the noiseless
setting, we identify an exact asymptotic threshold on the required number of
tests with optimal decoding, and prove a phase transition between complete
success and complete failure. In addition, we present a novel noisy variation
of the problem, and provide an information-theoretic framework for
characterizing the required number of tests for general random noise models.
Our results reveal that noise can make the problem considerably more difficult,
with strict increases in the scaling laws even at low noise levels. Finally, we
demonstrate similar behavior in an approximate recovery setting, where a given
number of errors is allowed in the decoded labels.Comment: Accepted to NIPS 201
Fundamental limits of symmetric low-rank matrix estimation
We consider the high-dimensional inference problem where the signal is a
low-rank symmetric matrix which is corrupted by an additive Gaussian noise.
Given a probabilistic model for the low-rank matrix, we compute the limit in
the large dimension setting for the mutual information between the signal and
the observations, as well as the matrix minimum mean square error, while the
rank of the signal remains constant. We also show that our model extends beyond
the particular case of additive Gaussian noise and we prove an universality
result connecting the community detection problem to our Gaussian framework. We
unify and generalize a number of recent works on PCA, sparse PCA, submatrix
localization or community detection by computing the information-theoretic
limits for these problems in the high noise regime. In addition, we show that
the posterior distribution of the signal given the observations is
characterized by a parameter of the same dimension as the square of the rank of
the signal (i.e. scalar in the case of rank one). Finally, we connect our work
with the hard but detectable conjecture in statistical physics
Testing Conditional Independence of Discrete Distributions
We study the problem of testing \emph{conditional independence} for discrete
distributions. Specifically, given samples from a discrete random variable on domain , we want to distinguish,
with probability at least , between the case that and are
conditionally independent given from the case that is
-far, in -distance, from every distribution that has this
property. Conditional independence is a concept of central importance in
probability and statistics with a range of applications in various scientific
domains. As such, the statistical task of testing conditional independence has
been extensively studied in various forms within the statistics and
econometrics communities for nearly a century. Perhaps surprisingly, this
problem has not been previously considered in the framework of distribution
property testing and in particular no tester with sublinear sample complexity
is known, even for the important special case that the domains of and
are binary.
The main algorithmic result of this work is the first conditional
independence tester with {\em sublinear} sample complexity for discrete
distributions over . To complement our upper
bounds, we prove information-theoretic lower bounds establishing that the
sample complexity of our algorithm is optimal, up to constant factors, for a
number of settings. Specifically, for the prototypical setting when , we show that the sample complexity of testing conditional
independence (upper bound and matching lower bound) is
\[
\Theta\left({\max\left(n^{1/2}/\epsilon^2,\min\left(n^{7/8}/\epsilon,n^{6/7}/\epsilon^{8/7}\right)\right)}\right)\,.
\
- …