20 research outputs found
Marvels and Pitfalls of the Langevin Algorithm in Noisy High-Dimensional Inference
Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work, we carry out an analytic study of the performance of the algorithm most commonly considered in physics, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked mixed matrix-tensor model. The typical behavior of this algorithm is described by a system of integrodifferential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is suboptimal with respect to the one given by AMP. This phenomenon is due to the residual glassiness present in that region of parameters. We also present a simple heuristic expression of the transition line, which appears to be in agreement with the numerical results
Thresholds of descending algorithms in inference problems
We review recent works on analyzing the dynamics of gradient-based algorithms
in a prototypical statistical inference problem. Using methods and insights
from the physics of glassy systems, these works showed how to understand
quantitatively and qualitatively the performance of gradient-based algorithms.
Here we review the key results and their interpretation in non-technical terms
accessible to a wide audience of physicists in the context of related works.Comment: 8 pages, 4 figure
Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape
Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used
extensively to train artificial neural networks. However very little is known
on to what extent SGD is crucial for to the success of this technology and, in
particular, how much it is effective in optimizing high-dimensional non-convex
cost functions as compared to other optimization algorithms such as Gradient
Descent (GD). In this work we leverage dynamical mean field theory to analyze
exactly its performances in the high-dimensional limit. We consider the problem
of recovering a hidden high-dimensional non-linearly encrypted signal, a
prototype high-dimensional non-convex hard optimization problem. We compare the
performances of SGD to GD and we show that SGD largely outperforms GD. In
particular, a power law fit of the relaxation time of these algorithms shows
that the recovery threshold for SGD with small batch size is smaller than the
corresponding one of GD.Comment: 5 pages + appendix. 3 figure
Who is afraid of big bad minima? Analysis of gradient-flow in spiked matrix-tensor models
Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes
Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima
Online stochastic gradient descent on non-convex losses from high-dimensional inference
Stochastic gradient descent (SGD) is a popular algorithm for optimization
problems arising in high-dimensional inference tasks. Here one produces an
estimator of an unknown parameter from independent samples of data by
iteratively optimizing a loss function. This loss function is random and often
non-convex. We study the performance of the simplest version of SGD, namely
online SGD, from a random start in the setting where the parameter space is
high-dimensional.
We develop nearly sharp thresholds for the number of samples needed for
consistent estimation as one varies the dimension. Our thresholds depend only
on an intrinsic property of the population loss which we call the information
exponent. In particular, our results do not assume uniform control on the loss
itself, such as convexity or uniform derivative bounds. The thresholds we
obtain are polynomial in the dimension and the precise exponent depends
explicitly on the information exponent. As a consequence of our results, we
find that except for the simplest tasks, almost all of the data is used simply
in the initial search phase to obtain non-trivial correlation with the ground
truth. Upon attaining non-trivial correlation, the descent is rapid and
exhibits law of large numbers type behaviour.
We illustrate our approach by applying it to a wide set of inference tasks
such as phase retrieval, parameter estimation for generalized linear models,
spiked matrix models, and spiked tensor models, as well as for supervised
learning for single-layer networks with general activation functions.Comment: Substantially revised presentation. Figures adde