Search CORE

5,936 research outputs found

Stable Weight Decay Regularization

Author: Sato Issei
Sugiyama Masashi
Xie Zeke
Publication venue
Publication date: 08/02/2021
Field of study

Weight decay is a popular regularization technique for training of deep neural networks. Modern deep learning libraries mainly use

L_{2}

regularization as the default implementation of weight decay. \citet{loshchilov2018decoupled} demonstrated that

L_{2}

regularization is not identical to weight decay for adaptive gradient methods, such as Adaptive Momentum Estimation (Adam), and proposed Adam with Decoupled Weight Decay (AdamW). However, we found that the popular implementations of weight decay, including

L_{2}

regularization and decoupled weight decay, in modern deep learning libraries usually damage performance. First, the

L_{2}

regularization is unstable weight decay for all optimizers that use Momentum, such as stochastic gradient descent (SGD). Second, decoupled weight decay is highly unstable for all adaptive gradient methods. We further propose the Stable Weight Decay (SWD) method to fix the unstable weight decay problem from a dynamical perspective. The proposed SWD method makes significant improvements over

L_{2}

regularization and decoupled weight decay in our experiments. Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.Comment: 20 pages, 18 figures, Weight Deca

arXiv.org e-Print Archive

Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Author: Aitchison Laurence
Publication venue
Publication date: 31/07/2019
Field of study

We formulate the problem of neural network optimization as Bayesian filtering, where the observations are the backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are geing optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD

arXiv.org e-Print Archive

Explore Bristol Research

Novel Structured Low-rank algorithm to recover spatially smooth exponential image time series

Author: Balachandrasekaran Arvind
Jacob Mathews
Publication venue
Publication date: 29/03/2017
Field of study

We propose a structured low rank matrix completion algorithm to recover a time series of images consisting of linear combination of exponential parameters at every pixel, from under-sampled Fourier measurements. The spatial smoothness of these parameters is exploited along with the exponential structure of the time series at every pixel, to derive an annihilation relation in the

k-t

domain. This annihilation relation translates into a structured low rank matrix formed from the

k-t

samples. We demonstrate the algorithm in the parameter mapping setting and show significant improvement over state of the art methods.Comment: 4 pages, 3 figures, accepted at ISBI 2017, Melbourne, Australi

arXiv.org e-Print Archive

Crossref

Xception: Deep Learning with Depthwise Separable Convolutions

Author: Chollet François
Publication venue
Publication date: 04/04/2017
Field of study

We present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). In this light, a depthwise separable convolution can be understood as an Inception module with a maximally large number of towers. This observation leads us to propose a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions. We show that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset (which Inception V3 was designed for), and significantly outperforms Inception V3 on a larger image classification dataset comprising 350 million images and 17,000 classes. Since the Xception architecture has the same number of parameters as Inception V3, the performance gains are not due to increased capacity but rather to a more efficient use of model parameters

arXiv.org e-Print Archive

Crossref

You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle

Author: Dong Bin
Lu Yiping
Zhang Dinghuai
Zhang Tianyuan
Zhu Zhanxing
Publication venue
Publication date: 01/11/2019
Field of study

Deep learning achieves state-of-the-art results in many tasks in computer vision and natural language processing. However, recent works have shown that deep networks can be vulnerable to adversarial perturbations, which raised a serious robustness issue of deep networks. Adversarial training, typically formulated as a robust optimization problem, is an effective way of improving the robustness of deep networks. A major drawback of existing adversarial training algorithms is the computational overhead of the generation of adversarial examples, typically far greater than that of the network training. This leads to the unbearable overall computational cost of adversarial training. In this paper, we show that adversarial training can be cast as a discrete time differential game. Through analyzing the Pontryagin's Maximal Principle (PMP) of the problem, we observe that the adversary update is only coupled with the parameters of the first layer of the network. This inspires us to restrict most of the forward and back propagation within the first layer of the network during adversary updates. This effectively reduces the total number of full forward and backward propagation to only one for each group of adversary updates. Therefore, we refer to this algorithm YOPO (You Only Propagate Once). Numerical experiments demonstrate that YOPO can achieve comparable defense accuracy with approximately 1/5 ~ 1/4 GPU time of the projected gradient descent (PGD) algorithm. Our codes are available at https://https://github.com/a1600012888/YOPO-You-Only-Propagate-Once.Comment: Accepted as a conference paper at NeurIPS 201

arXiv.org e-Print Archive

Southampton (e-Prints Soton)