16,256 research outputs found
Learning Mixtures of Gaussians in High Dimensions
Efficiently learning mixture of Gaussians is a fundamental problem in
statistics and learning theory. Given samples coming from a random one out of k
Gaussian distributions in Rn, the learning problem asks to estimate the means
and the covariance matrices of these Gaussians. This learning problem arises in
many areas ranging from the natural sciences to the social sciences, and has
also found many machine learning applications. Unfortunately, learning mixture
of Gaussians is an information theoretically hard problem: in order to learn
the parameters up to a reasonable accuracy, the number of samples required is
exponential in the number of Gaussian components in the worst case. In this
work, we show that provided we are in high enough dimensions, the class of
Gaussian mixtures is learnable in its most general form under a smoothed
analysis framework, where the parameters are randomly perturbed from an
adversarial starting point. In particular, given samples from a mixture of
Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give
an algorithm that learns the parameters with polynomial running time and using
polynomial number of samples. The central algorithmic ideas consist of new ways
to decompose the moment tensor of the Gaussian mixture by exploiting its
structural properties. The symmetries of this tensor are derived from the
combinatorial structure of higher order moments of Gaussian distributions
(sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop
new tools for bounding smallest singular values of structured random matrices,
which could be useful in other smoothed analysis settings
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions
We prove theoretical guarantees for an averaging-ensemble of randomly projected Fisher linear discriminant classifiers, focusing on the casewhen there are fewer training observations than data dimensions. The specific form and simplicity of this ensemble permits a direct and much more detailed analysis than existing generic tools in previous works. In particular, we are able to derive the exact form of the generalization error of our ensemble, conditional on the training set, and based on this we give theoretical guarantees which directly link the performance of the ensemble to that of the corresponding linear discriminant learned in the full data space. To the best of our knowledge these are the first theoretical results to prove such an explicit link for any classifier and classifier ensemble pair. Furthermore we show that the randomly projected ensemble is equivalent to implementing a sophisticated regularization scheme to the linear discriminant learned in the original data space and this prevents overfitting in conditions of small sample size where pseudo-inverse FLD learned in the data space is provably poor. Our ensemble is learned from a set of randomly projected representations of the original high dimensional data and therefore for this approach data can be collected, stored and processed in such a compressed form. We confirm our theoretical findings with experiments, and demonstrate the utility of our approach on several datasets from the bioinformatics domain and one very high dimensional dataset from the drug discovery domain, both settings in which fewer observations than dimensions are the norm
Neural Connectivity with Hidden Gaussian Graphical State-Model
The noninvasive procedures for neural connectivity are under questioning.
Theoretical models sustain that the electromagnetic field registered at
external sensors is elicited by currents at neural space. Nevertheless, what we
observe at the sensor space is a superposition of projected fields, from the
whole gray-matter. This is the reason for a major pitfall of noninvasive
Electrophysiology methods: distorted reconstruction of neural activity and its
connectivity or leakage. It has been proven that current methods produce
incorrect connectomes. Somewhat related to the incorrect connectivity
modelling, they disregard either Systems Theory and Bayesian Information
Theory. We introduce a new formalism that attains for it, Hidden Gaussian
Graphical State-Model (HIGGS). A neural Gaussian Graphical Model (GGM) hidden
by the observation equation of Magneto-encephalographic (MEEG) signals. HIGGS
is equivalent to a frequency domain Linear State Space Model (LSSM) but with
sparse connectivity prior. The mathematical contribution here is the theory for
high-dimensional and frequency-domain HIGGS solvers. We demonstrate that HIGGS
can attenuate the leakage effect in the most critical case: the distortion EEG
signal due to head volume conduction heterogeneities. Its application in EEG is
illustrated with retrieved connectivity patterns from human Steady State Visual
Evoked Potentials (SSVEP). We provide for the first time confirmatory evidence
for noninvasive procedures of neural connectivity: concurrent EEG and
Electrocorticography (ECoG) recordings on monkey. Open source packages are
freely available online, to reproduce the results presented in this paper and
to analyze external MEEG databases
- …