20 research outputs found

    Marvels and Pitfalls of the Langevin Algorithm in Noisy High-Dimensional Inference

    Get PDF
    Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work, we carry out an analytic study of the performance of the algorithm most commonly considered in physics, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked mixed matrix-tensor model. The typical behavior of this algorithm is described by a system of integrodifferential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is suboptimal with respect to the one given by AMP. This phenomenon is due to the residual glassiness present in that region of parameters. We also present a simple heuristic expression of the transition line, which appears to be in agreement with the numerical results

    Thresholds of descending algorithms in inference problems

    Full text link
    We review recent works on analyzing the dynamics of gradient-based algorithms in a prototypical statistical inference problem. Using methods and insights from the physics of glassy systems, these works showed how to understand quantitatively and qualitatively the performance of gradient-based algorithms. Here we review the key results and their interpretation in non-technical terms accessible to a wide audience of physicists in the context of related works.Comment: 8 pages, 4 figure

    Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape

    Full text link
    Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artificial neural networks. However very little is known on to what extent SGD is crucial for to the success of this technology and, in particular, how much it is effective in optimizing high-dimensional non-convex cost functions as compared to other optimization algorithms such as Gradient Descent (GD). In this work we leverage dynamical mean field theory to analyze exactly its performances in the high-dimensional limit. We consider the problem of recovering a hidden high-dimensional non-linearly encrypted signal, a prototype high-dimensional non-convex hard optimization problem. We compare the performances of SGD to GD and we show that SGD largely outperforms GD. In particular, a power law fit of the relaxation time of these algorithms shows that the recovery threshold for SGD with small batch size is smaller than the corresponding one of GD.Comment: 5 pages + appendix. 3 figure

    Who is afraid of big bad minima? Analysis of gradient-flow in spiked matrix-tensor models

    Get PDF
    Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes

    Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

    Get PDF
    Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima

    Online stochastic gradient descent on non-convex losses from high-dimensional inference

    Full text link
    Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining non-trivial correlation, the descent is rapid and exhibits law of large numbers type behaviour. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, parameter estimation for generalized linear models, spiked matrix models, and spiked tensor models, as well as for supervised learning for single-layer networks with general activation functions.Comment: Substantially revised presentation. Figures adde
    corecore