15 research outputs found
Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets
Mode connectivity is a surprising phenomenon in the loss landscape of deep
nets. Optima -- at least those discovered by gradient-based optimization --
turn out to be connected by simple paths on which the loss function is almost
constant. Often, these paths can be chosen to be piece-wise linear, with as few
as two segments. We give mathematical explanations for this phenomenon,
assuming generic properties (such as dropout stability and noise stability) of
well-trained deep nets, which have previously been identified as part of
understanding the generalization properties of deep nets. Our explanation holds
for realistic multilayer nets, and experiments are presented to verify the
theory
Landscape connectivity and dropout stability of SGD solutions for over-parameterized neural networks
The optimization of multilayer neural networks typically leads to a solution
with zero training error, yet the landscape can exhibit spurious local minima
and the minima can be disconnected. In this paper, we shed light on this
phenomenon: we show that the combination of stochastic gradient descent (SGD)
and over-parameterization makes the landscape of multilayer neural networks
approximately connected and thus more favorable to optimization. More
specifically, we prove that SGD solutions are connected via a piecewise linear
path, and the increase in loss along this path vanishes as the number of
neurons grows large. This result is a consequence of the fact that the
parameters found by SGD are increasingly dropout stable as the network becomes
wider. We show that, if we remove part of the neurons (and suitably rescale the
remaining ones), the change in loss is independent of the total number of
neurons, and it depends only on how many neurons are left. Our results exhibit
a mild dependence on the input dimension: they are dimension-free for two-layer
networks and depend linearly on the dimension for multilayer networks. We
validate our theoretical findings with numerical experiments for different
architectures and classification tasks
When Are Solutions Connected in Deep Networks?
The question of how and why the phenomenon of mode connectivity occurs in
training deep neural networks has gained remarkable attention in the research
community. From a theoretical perspective, two possible explanations have been
proposed: (i) the loss function has connected sublevel sets, and (ii) the
solutions found by stochastic gradient descent are dropout stable. While these
explanations provide insights into the phenomenon, their assumptions are not
always satisfied in practice. In particular, the first approach requires the
network to have one layer with order of neurons ( being the number of
training samples), while the second one requires the loss to be almost
invariant after removing half of the neurons at each layer (up to some
rescaling of the remaining ones). In this work, we improve both conditions by
exploiting the quality of the features at every intermediate layer together
with a milder over-parameterization condition. More specifically, we show that:
(i) under generic assumptions on the features of intermediate layers, it
suffices that the last two hidden layers have order of neurons, and
(ii) if subsets of features at each layer are linearly separable, then no
over-parameterization is needed to show the connectivity. Our experiments
confirm that the proposed condition ensures the connectivity of solutions found
by stochastic gradient descent, even in settings where the previous
requirements do not hold.Comment: Accepted at NeurIPS 202
The Global Landscape of Neural Networks: An Overview
One of the major concerns for neural network training is that the
non-convexity of the associated loss functions may cause bad landscape. The
recent success of neural networks suggests that their loss landscape is not too
bad, but what specific results do we know about the landscape? In this article,
we review recent findings and results on the global landscape of neural
networks. First, we point out that wide neural nets may have sub-optimal local
minima under certain assumptions. Second, we discuss a few rigorous results on
the geometric properties of wide networks such as "no bad basin", and some
modifications that eliminate sub-optimal local minima and/or decreasing paths
to infinity. Third, we discuss visualization and empirical explorations of the
landscape for practical neural nets. Finally, we briefly discuss some
convergence results and their relation to landscape results.Comment: 16 pages. 8 figure
Flatter, faster: scaling momentum for optimal speedup of SGD
Commonly used optimization algorithms often show a trade-off between good
generalization and fast training times. For instance, stochastic gradient
descent (SGD) tends to have good generalization; however, adaptive gradient
methods have superior training times. Momentum can help accelerate training
with SGD, but so far there has been no principled way to select the momentum
hyperparameter. Here we study training dynamics arising from the interplay
between SGD with label noise and momentum in the training of overparametrized
neural networks. We find that scaling the momentum hyperparameter
with the learning rate to the power of maximally accelerates training,
without sacrificing generalization. To analytically derive this result we
develop an architecture-independent framework, where the main assumption is the
existence of a degenerate manifold of global minimizers, as is natural in
overparametrized models. Training dynamics display the emergence of two
characteristic timescales that are well-separated for generic values of the
hyperparameters. The maximum acceleration of training is reached when these two
timescales meet, which in turn determines the scaling limit we propose. We
confirm our scaling rule for synthetic regression problems (matrix sensing and
teacher-student paradigm) and classification for realistic datasets (ResNet-18
on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our
scaling rule to variations in architectures and datasets.Comment: v2: expanded introduction section, corrected minor typos. v1: 12+13
pages, 3 figure
Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks
Monotonic linear interpolation (MLI) - on the line connecting a random
initialization with the minimizer it converges to, the loss and accuracy are
monotonic - is a phenomenon that is commonly observed in the training of neural
networks. Such a phenomenon may seem to suggest that optimization of neural
networks is easy. In this paper, we show that the MLI property is not
necessarily related to the hardness of optimization problems, and empirical
observations on MLI for deep neural networks depend heavily on biases. In
particular, we show that interpolating both weights and biases linearly leads
to very different influences on the final output, and when different classes
have different last-layer biases on a deep network, there will be a long
plateau in both the loss and accuracy interpolation (which existing theory of
MLI cannot explain). We also show how the last-layer biases for different
classes can be different even on a perfectly balanced dataset using a simple
model. Empirically we demonstrate that similar intuitions hold on practical
networks and realistic datasets.Comment: ICLR 202
Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks
The optimization of multilayer neural networks typically leads to a solution
with zero training error, yet the landscape can exhibit spurious local minima
and the minima can be disconnected. In this paper, we shed light on this
phenomenon: we show that the combination of stochastic gradient descent (SGD)
and over-parameterization makes the landscape of multilayer neural networks
approximately connected and thus more favorable to optimization. More
specifically, we prove that SGD solutions are connected via a piecewise linear
path, and the increase in loss along this path vanishes as the number of
neurons grows large. This result is a consequence of the fact that the
parameters found by SGD are increasingly dropout stable as the network becomes
wider. We show that, if we remove part of the neurons (and suitably rescale the
remaining ones), the change in loss is independent of the total number of
neurons, and it depends only on how many neurons are left. Our results exhibit
a mild dependence on the input dimension: they are dimension-free for two-layer
networks and depend linearly on the dimension for multilayer networks. We
validate our theoretical findings with numerical experiments for different
architectures and classification tasks.Comment: Proceedings of the 37th International Conference on Machine Learning
(ICML