Search CORE

5 research outputs found

Learning Arbitrary Statistical Mixtures of Discrete Distributions

Author: Anandkumar A.
Anandkumar Anima
Chaudhuri K.
Chaudhuri K.
Chaudhuri K.
Dasgupta S.
Dudley Richard M
Dudley Richard M
Hofmann T.
Hofmann T.
Rivlin Theodore J
Tomczak-Jaegermann Nicole
Publication venue
Publication date: 09/04/2015
Field of study

We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, the model to be learned,

\vartheta

, is a probability distribution over probability distributions

p

, where each such

p

is a probability distribution over

[n] = \{1,2,\dots,n\}

. When we sample from

\vartheta

, we do not observe

p

directly, but only indirectly and in very noisy fashion, by sampling from

[n]

repeatedly, independently

K

times from the distribution

p

. The problem is to infer

\vartheta

to high accuracy in transportation (earthmover) distance. We give the first efficient algorithms for learning this mixture model without making any restricting assumptions on the structure of the distribution

\vartheta

. We bound the quality of the solution as a function of the size of the samples

K

and the number of samples used. Our model and results have applications to a variety of unsupervised learning scenarios, including learning topic models and collaborative filtering.Comment: 23 pages. Preliminary version in the Proceeding of the 47th ACM Symposium on the Theory of Computing (STOC15

arXiv.org e-Print Archive

Crossref

Caltech Authors

A rigorous analysis of population stratification with limited data

Author: Eran Halperin
Kamalika Chaudhuri
Satish Rao
Shuheng Zhou
Publication venue
Publication date: 01/01/2007
Field of study

Abstract Finding the genetic factors of complex diseases such as can-cer, currently a major effort of the international community, will potentially lead to better treatment of these diseases.One of the major difficulties in these studies, is the fact that the genetic components of an individual not only depend onthe disease, but also on its ethnicity. Therefore, it is crucial to find methods that could reduce the population structureeffects on these studies. This can be formalized as a clustering problem, where the individuals are clustered accordingto their genetic information. Mathematically, we consider the problem of clusteringbit &quot;feature &quot; vectors, where each vector represents the genetic information of an individual. Our model assumes thatthis bit vector is generated according to a prior probability distribution specified by the individual's membership in apopulation. We present methods that can cluster the vectors while attempting to optimize the number of featuresrequired. The focus of the paper is not on the algorithms, but on showing that optimizing certain objective functionson the data yields the right clustering, under the random generative model. In particular, we prove that some of theprevious formulations for clustering are effective. We consider two different clustering approaches. Thefirst approach forms a graph, and then clusters the data using a connected components algorithm, or a max cut algo-rithm. The second approach tries to estimate simultanously the feature frequencies in each of the populations, and theclassification of vectors into populations. We show that using the first approach \Theta (log N/fl2) data (i.e., total numberof features times number of vectors) is sufficient to find the correct classification, where N is the number of vectors of each population, and fl is the average `22 distance betweenthe feature probability vectors of the two populations. Using the second approach, we show that O(log N/ff4) datais enough, where ff is the average ` 1 distance between thepopulations

CiteSeerX