445 research outputs found
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Conventional wisdom in deep learning states that increasing depth improves
expressiveness but complicates optimization. This paper suggests that,
sometimes, increasing depth can speed up optimization. The effect of depth on
optimization is decoupled from expressiveness by focusing on settings where
additional layers amount to overparameterization - linear neural networks, a
well-studied model. Theoretical analysis, as well as experiments, show that
here depth acts as a preconditioner which may accelerate convergence. Even on
simple convex problems such as linear regression with loss, ,
gradient descent can benefit from transitioning to a non-convex
overparameterized objective, more than it would from some common acceleration
schemes. We also prove that it is mathematically impossible to obtain the
acceleration effect of overparametrization via gradients of any regularizer.Comment: Published at the International Conference on Machine Learning (ICML)
201
The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks
Over the past few years, an extensively studied phenomenon in training deep
networks is the implicit bias of gradient descent towards parsimonious
solutions. In this work, we investigate this phenomenon by narrowing our focus
to deep linear networks. Through our analysis, we reveal a surprising "law of
parsimony" in the learning dynamics when the data possesses low-dimensional
structures. Specifically, we show that the evolution of gradient descent
starting from orthogonal initialization only affects a minimal portion of
singular vector spaces across all weight matrices. In other words, the learning
process happens only within a small invariant subspace of each weight matrix,
despite the fact that all weight parameters are updated throughout training.
This simplicity in learning dynamics could have significant implications for
both efficient training and a better understanding of deep networks. First, the
analysis enables us to considerably improve training efficiency by taking
advantage of the low-dimensional structure in learning dynamics. We can
construct smaller, equivalent deep linear networks without sacrificing the
benefits associated with the wider counterparts. Second, it allows us to better
understand deep representation learning by elucidating the linear progressive
separation and concentration of representations from shallow to deep layers. We
also conduct numerical experiments to support our theoretical results. The code
for our experiments can be found at https://github.com/cjyaras/lawofparsimony.Comment: The first two authors contributed to this work equally; 32 pages, 12
figure
Householder-Absolute Neural Layers For High Variability and Deep Trainability
We propose a new architecture for artificial neural networks called
Householder-absolute neural layers, or Han-layers for short, that use
Householder reflectors as weight matrices and the absolute-value function for
activation. Han-layers, functioning as fully connected layers, are motivated by
recent results on neural-network variability and are designed to increase
activation ratio and reduce the chance of Collapse to Constants. Neural
networks constructed chiefly from Han-layers are called HanNets. By
construction, HanNets enjoy a theoretical guarantee that vanishing or exploding
gradient never occurs. We conduct several proof-of-concept experiments. Some
surprising results obtained on styled test problems suggest that, under certain
conditions, HanNets exhibit an unusual ability to produce nearly perfect
solutions unattainable by fully connected networks. Experiments on regression
datasets show that HanNets can significantly reduce the number of model
parameters while maintaining or improving the level of generalization accuracy.
In addition, by adding a few Han-layers into the pre-classification FC-layer of
a convolutional neural network, we are able to quickly improve a
state-of-the-art result on CIFAR10 dataset. These proof-of-concept results are
sufficient to necessitate further studies on HanNets to understand their
capacities and limits, and to exploit their potentials in real-world
applications
- …