5,936 research outputs found
Stable Weight Decay Regularization
Weight decay is a popular regularization technique for training of deep
neural networks. Modern deep learning libraries mainly use
regularization as the default implementation of weight decay.
\citet{loshchilov2018decoupled} demonstrated that regularization is not
identical to weight decay for adaptive gradient methods, such as Adaptive
Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay
(AdamW). However, we found that the popular implementations of weight decay,
including regularization and decoupled weight decay, in modern deep
learning libraries usually damage performance. First, the
regularization is unstable weight decay for all optimizers that use Momentum,
such as stochastic gradient descent (SGD). Second, decoupled weight decay is
highly unstable for all adaptive gradient methods. We further propose the
Stable Weight Decay (SWD) method to fix the unstable weight decay problem from
a dynamical perspective. The proposed SWD method makes significant improvements
over regularization and decoupled weight decay in our experiments.
Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can
usually outperform complex Adam variants, which have more hyperparameters.Comment: 20 pages, 18 figures, Weight Deca
Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods
We formulate the problem of neural network optimization as Bayesian
filtering, where the observations are the backpropagated gradients. While
neural network optimization has previously been studied using natural gradient
methods which are closely related to Bayesian inference, they were unable to
recover standard optimizers such as Adam and RMSprop with a root-mean-square
gradient normalizer, instead getting a mean-square normalizer. To recover the
root-mean-square normalizer, we find it necessary to account for the temporal
dynamics of all the other parameters as they are geing optimized. The resulting
optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like
behaviour, automatically recovers AdamW, a state of the art variant of Adam
with decoupled weight decay, and has generalisation performance competitive
with SGD
Novel Structured Low-rank algorithm to recover spatially smooth exponential image time series
We propose a structured low rank matrix completion algorithm to recover a
time series of images consisting of linear combination of exponential
parameters at every pixel, from under-sampled Fourier measurements. The spatial
smoothness of these parameters is exploited along with the exponential
structure of the time series at every pixel, to derive an annihilation relation
in the domain. This annihilation relation translates into a structured
low rank matrix formed from the samples. We demonstrate the algorithm in
the parameter mapping setting and show significant improvement over state of
the art methods.Comment: 4 pages, 3 figures, accepted at ISBI 2017, Melbourne, Australi
Xception: Deep Learning with Depthwise Separable Convolutions
We present an interpretation of Inception modules in convolutional neural
networks as being an intermediate step in-between regular convolution and the
depthwise separable convolution operation (a depthwise convolution followed by
a pointwise convolution). In this light, a depthwise separable convolution can
be understood as an Inception module with a maximally large number of towers.
This observation leads us to propose a novel deep convolutional neural network
architecture inspired by Inception, where Inception modules have been replaced
with depthwise separable convolutions. We show that this architecture, dubbed
Xception, slightly outperforms Inception V3 on the ImageNet dataset (which
Inception V3 was designed for), and significantly outperforms Inception V3 on a
larger image classification dataset comprising 350 million images and 17,000
classes. Since the Xception architecture has the same number of parameters as
Inception V3, the performance gains are not due to increased capacity but
rather to a more efficient use of model parameters
You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle
Deep learning achieves state-of-the-art results in many tasks in computer
vision and natural language processing. However, recent works have shown that
deep networks can be vulnerable to adversarial perturbations, which raised a
serious robustness issue of deep networks. Adversarial training, typically
formulated as a robust optimization problem, is an effective way of improving
the robustness of deep networks. A major drawback of existing adversarial
training algorithms is the computational overhead of the generation of
adversarial examples, typically far greater than that of the network training.
This leads to the unbearable overall computational cost of adversarial
training. In this paper, we show that adversarial training can be cast as a
discrete time differential game. Through analyzing the Pontryagin's Maximal
Principle (PMP) of the problem, we observe that the adversary update is only
coupled with the parameters of the first layer of the network. This inspires us
to restrict most of the forward and back propagation within the first layer of
the network during adversary updates. This effectively reduces the total number
of full forward and backward propagation to only one for each group of
adversary updates. Therefore, we refer to this algorithm YOPO (You Only
Propagate Once). Numerical experiments demonstrate that YOPO can achieve
comparable defense accuracy with approximately 1/5 ~ 1/4 GPU time of the
projected gradient descent (PGD) algorithm. Our codes are available at
https://https://github.com/a1600012888/YOPO-You-Only-Propagate-Once.Comment: Accepted as a conference paper at NeurIPS 201
- …