4,648 research outputs found
Information-Theoretic and Algorithmic Thresholds for Group Testing
In the group testing problem we aim to identify a small number of infected individuals within a large population. We avail ourselves to a procedure that can test a group of multiple individuals, with the test result coming out positive iff at least one individual in the group is infected. With all tests conducted in parallel, what is the least number of tests required to identify the status of all individuals? In a recent test design [Aldridge et al. 2016] the individuals are assigned to test groups randomly, with every individual joining an equal number of groups. We pinpoint the sharp threshold for the number of tests required in this randomised design so that it is information-theoretically possible to infer the infection status of every individual. Moreover, we analyse two efficient inference algorithms. These results settle conjectures from [Aldridge et al. 2014, Johnson et al. 2019]
Techniques for clustering gene expression data
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered
Recovering Structured Probability Matrices
We consider the problem of accurately recovering a matrix B of size M by M ,
which represents a probability distribution over M2 outcomes, given access to
an observed matrix of "counts" generated by taking independent samples from the
distribution B. How can structural properties of the underlying matrix B be
leveraged to yield computationally efficient and information theoretically
optimal reconstruction algorithms? When can accurate reconstruction be
accomplished in the sparse data regime? This basic problem lies at the core of
a number of questions that are currently being considered by different
communities, including building recommendation systems and collaborative
filtering in the sparse data regime, community detection in sparse random
graphs, learning structured models such as topic models or hidden Markov
models, and the efforts from the natural language processing community to
compute "word embeddings".
Our results apply to the setting where B has a low rank structure. For this
setting, we propose an efficient algorithm that accurately recovers the
underlying M by M matrix using Theta(M) samples. This result easily translates
to Theta(M) sample algorithms for learning topic models and learning hidden
Markov Models. These linear sample complexities are optimal, up to constant
factors, in an extremely strong sense: even testing basic properties of the
underlying matrix (such as whether it has rank 1 or 2) requires Omega(M)
samples. We provide an even stronger lower bound where distinguishing whether a
sequence of observations were drawn from the uniform distribution over M
observations versus being generated by an HMM with two hidden states requires
Omega(M) observations. This precludes sublinear-sample hypothesis tests for
basic properties, such as identity or uniformity, as well as sublinear sample
estimators for quantities such as the entropy rate of HMMs
Optimal Testing for Planted Satisfiability Problems
We study the problem of detecting planted solutions in a random
satisfiability formula. Adopting the formalism of hypothesis testing in
statistical analysis, we describe the minimax optimal rates of detection. Our
analysis relies on the study of the number of satisfying assignments, for which
we prove new results. We also address algorithmic issues, and give a
computationally efficient test with optimal statistical performance. This
result is compared to an average-case hypothesis on the hardness of refuting
satisfiability of random formulas
- âŠ