6,691 research outputs found
Statistical Mechanics of High-Dimensional Inference
To model modern large-scale datasets, we need efficient algorithms to infer a
set of unknown model parameters from noisy measurements. What are
fundamental limits on the accuracy of parameter inference, given finite
signal-to-noise ratios, limited measurements, prior information, and
computational tractability requirements? How can we combine prior information
with measurements to achieve these limits? Classical statistics gives incisive
answers to these questions as the measurement density . However, these classical results are not
relevant to modern high-dimensional inference problems, which instead occur at
finite . We formulate and analyze high-dimensional inference as a
problem in the statistical physics of quenched disorder. Our analysis uncovers
fundamental limits on the accuracy of inference in high dimensions, and reveals
that widely cherished inference algorithms like maximum likelihood (ML) and
maximum-a posteriori (MAP) inference cannot achieve these limits. We further
find optimal, computationally tractable algorithms that can achieve these
limits. Intriguingly, in high dimensions, these optimal algorithms become
computationally simpler than MAP and ML, while still outperforming them. For
example, such optimal algorithms can lead to as much as a 20% reduction in the
amount of data to achieve the same performance relative to MAP. Moreover, our
analysis reveals simple relations between optimal high dimensional inference
and low dimensional scalar Bayesian inference, insights into the nature of
generalization and predictive power in high dimensions, information theoretic
limits on compressed sensing, phase transitions in quadratic inference, and
connections to central mathematical objects in convex optimization theory and
random matrix theory.Comment: See http://ganguli-gang.stanford.edu/pdf/HighDimInf.Supp.pdf for
supplementary materia
Statistical Mechanics of Learning: A Variational Approach for Real Data
Using a variational technique, we generalize the statistical physics approach
of learning from random examples to make it applicable to real data. We
demonstrate the validity and relevance of our method by computing approximate
estimators for generalization errors that are based on training data alone.Comment: 4 pages, 2 figure
Learning curves for Gaussian process regression: Approximations and bounds
We consider the problem of calculating learning curves (i.e., average
generalization performance) of Gaussian processes used for regression. On the
basis of a simple expression for the generalization error, in terms of the
eigenvalue decomposition of the covariance function, we derive a number of
approximation schemes. We identify where these become exact, and compare with
existing bounds on learning curves; the new approximations, which can be used
for any input space dimension, generally get substantially closer to the truth.
We also study possible improvements to our approximations. Finally, we use a
simple exactly solvable learning scenario to show that there are limits of
principle on the quality of approximations and bounds expressible solely in
terms of the eigenvalue spectrum of the covariance function.Comment: 25 pages, 10 figure
The Information Complexity of Learning Tasks, their Structure and their Distance
We introduce an asymmetric distance in the space of learning tasks, and a
framework to compute their complexity. These concepts are foundational for the
practice of transfer learning, whereby a parametric model is pre-trained for a
task, and then fine-tuned for another. The framework we develop is
non-asymptotic, captures the finite nature of the training dataset, and allows
distinguishing learning from memorization. It encompasses, as special cases,
classical notions from Kolmogorov complexity, Shannon, and Fisher Information.
However, unlike some of those frameworks, it can be applied to large-scale
models and real-world datasets. Our framework is the first to measure
complexity in a way that accounts for the effect of the optimization scheme,
which is critical in Deep Learning
Combining predictions from linear models when training and test inputs differ
Methods for combining predictions from different models in a supervised
learning setting must somehow estimate/predict the quality of a model's
predictions at unknown future inputs. Many of these methods (often implicitly)
make the assumption that the test inputs are identical to the training inputs,
which is seldom reasonable. By failing to take into account that prediction
will generally be harder for test inputs that did not occur in the training
set, this leads to the selection of too complex models. Based on a novel,
unbiased expression for KL divergence, we propose XAIC and its special case
FAIC as versions of AIC intended for prediction that use different degrees of
knowledge of the test inputs. Both methods substantially differ from and may
outperform all the known versions of AIC even when the training and test inputs
are iid, and are especially useful for deterministic inputs and under covariate
shift. Our experiments on linear models suggest that if the test and training
inputs differ substantially, then XAIC and FAIC predictively outperform AIC,
BIC and several other methods including Bayesian model averaging.Comment: 12 pages, 2 figures. To appear in Proceedings of the 30th Conference
on Uncertainty in Artificial Intelligence (UAI2014). This version includes
the supplementary material (regularity assumptions, proofs
- …