77,849 research outputs found
Understanding and Improving Layer Normalization
Layer normalization (LayerNorm) is a technique to normalize the distributions
of intermediate layers. It enables smoother gradients, faster training, and
better generalization accuracy. However, it is still unclear where the
effectiveness stems from. In this paper, our main contribution is to take a
step further in understanding LayerNorm. Many of previous studies believe that
the success of LayerNorm comes from forward normalization. Unlike them, we find
that the derivatives of the mean and variance are more important than forward
normalization by re-centering and re-scaling backward gradients. Furthermore,
we find that the parameters of LayerNorm, including the bias and gain, increase
the risk of over-fitting and do not work in most cases. Experiments show that a
simple version of LayerNorm (LayerNorm-simple) without the bias and gain
outperforms LayerNorm on four datasets. It obtains the state-of-the-art
performance on En-Vi machine translation. To address the over-fitting problem,
we propose a new normalization method, Adaptive Normalization (AdaNorm), by
replacing the bias and gain with a new transformation function. Experiments
show that AdaNorm demonstrates better results than LayerNorm on seven out of
eight datasets.Comment: Accepted by NeurIPS 201
Learning Neural Network Classifiers with Low Model Complexity
Modern neural network architectures for large-scale learning tasks have
substantially higher model complexities, which makes understanding, visualizing
and training these architectures difficult. Recent contributions to deep
learning techniques have focused on architectural modifications to improve
parameter efficiency and performance. In this paper, we derive a continuous and
differentiable error functional for a neural network that minimizes its
empirical error as well as a measure of the model complexity. The latter
measure is obtained by deriving a differentiable upper bound on the
Vapnik-Chervonenkis (VC) dimension of the classifier layer of a class of deep
networks. Using standard backpropagation, we realize a training rule that tries
to minimize the error on training samples, while improving generalization by
keeping the model complexity low. We demonstrate the effectiveness of our
formulation (the Low Complexity Neural Network - LCNN) across several deep
learning algorithms, and a variety of large benchmark datasets. We show that
hidden layer neurons in the resultant networks learn features that are crisp,
and in the case of image datasets, quantitatively sharper. Our proposed
approach yields benefits across a wide range of architectures, in comparison to
and in conjunction with methods such as Dropout and Batch Normalization, and
our results strongly suggest that deep learning techniques can benefit from
model complexity control methods such as the LCNN learning rule
Improving Video Generation for Multi-functional Applications
In this paper, we aim to improve the state-of-the-art video generative
adversarial networks (GANs) with a view towards multi-functional applications.
Our improved video GAN model does not separate foreground from background nor
dynamic from static patterns, but learns to generate the entire video clip
conjointly. Our model can thus be trained to generate - and learn from - a
broad set of videos with no restriction. This is achieved by designing a robust
one-stream video generation architecture with an extension of the
state-of-the-art Wasserstein GAN framework that allows for better convergence.
The experimental results show that our improved video GAN model outperforms
state-of-theart video generative models on multiple challenging datasets.
Furthermore, we demonstrate the superiority of our model by successfully
extending it to three challenging problems: video colorization, video
inpainting, and future prediction. To the best of our knowledge, this is the
first work using GANs to colorize and inpaint video clips
Static Activation Function Normalization
Recent seminal work at the intersection of deep neural networks practice and
random matrix theory has linked the convergence speed and robustness of these
networks with the combination of random weight initialization and nonlinear
activation function in use. Building on those principles, we introduce a
process to transform an existing activation function into another one with
better properties. We term such transform \emph{static activation
normalization}. More specifically we focus on this normalization applied to the
ReLU unit, and show empirically that it significantly promotes convergence
robustness, maximum training depth, and anytime performance. We verify these
claims by examining empirical eigenvalue distributions of networks trained with
those activations. Our static activation normalization provides a first step
towards giving benefits similar in spirit to schemes like batch normalization,
but without computational cost
Understanding Dropout as an Optimization Trick
As one of standard approaches to train deep neural networks, dropout has been
applied to regularize large models to avoid overfitting, and the improvement in
performance by dropout has been explained as avoiding co-adaptation between
nodes. However, when correlations between nodes are compared after training the
networks with or without dropout, one question arises if co-adaptation
avoidance explains the dropout effect completely. In this paper, we propose an
additional explanation of why dropout works and propose a new technique to
design better activation functions. First, we show that dropout can be
explained as an optimization technique to push the input towards the saturation
area of nonlinear activation function by accelerating gradient information
flowing even in the saturation area in backpropagation. Based on this
explanation, we propose a new technique for activation functions, {\em gradient
acceleration in activation function (GAAF)}, that accelerates gradients to flow
even in the saturation area. Then, input to the activation function can climb
onto the saturation area which makes the network more robust because the model
converges on a flat region. Experiment results support our explanation of
dropout and confirm that the proposed GAAF technique improves image
classification performance with expected properties.Comment: 16 page
Iterative Normalization: Beyond Standardization towards Efficient Whitening
Batch Normalization (BN) is ubiquitously employed for accelerating neural
network training and improving the generalization capability by performing
standardization within mini-batches. Decorrelated Batch Normalization (DBN)
further boosts the above effectiveness by whitening. However, DBN relies
heavily on either a large batch size, or eigen-decomposition that suffers from
poor efficiency on GPUs. We propose Iterative Normalization (IterNorm), which
employs Newton's iterations for much more efficient whitening, while
simultaneously avoiding the eigen-decomposition. Furthermore, we develop a
comprehensive study to show IterNorm has better trade-off between optimization
and generalization, with theoretical and experimental support. To this end, we
exclusively introduce Stochastic Normalization Disturbance (SND), which
measures the inherent stochastic uncertainty of samples when applied to
normalization operations. With the support of SND, we provide natural
explanations to several phenomena from the perspective of optimization, e.g.,
why group-wise whitening of DBN generally outperforms full-whitening and why
the accuracy of BN degenerates with reduced batch sizes. We demonstrate the
consistently improved performance of IterNorm with extensive experiments on
CIFAR-10 and ImageNet over BN and DBN.Comment: Accepted to CVPR 2019. The Code is available at
https://github.com/huangleiBuaa/IterNor
Improving Back-Propagation by Adding an Adversarial Gradient
The back-propagation algorithm is widely used for learning in artificial
neural networks. A challenge in machine learning is to create models that
generalize to new data samples not seen in the training data. Recently, a
common flaw in several machine learning algorithms was discovered: small
perturbations added to the input data lead to consistent misclassification of
data samples. Samples that easily mislead the model are called adversarial
examples. Training a "maxout" network on adversarial examples has shown to
decrease this vulnerability, but also increase classification performance. This
paper shows that adversarial training has a regularizing effect also in
networks with logistic, hyperbolic tangent and rectified linear units. A simple
extension to the back-propagation method is proposed, that adds an adversarial
gradient to the training. The extension requires an additional forward and
backward pass to calculate a modified input sample, or mini batch, used as
input for standard back-propagation learning. The first experimental results on
MNIST show that the "adversarial back-propagation" method increases the
resistance to adversarial examples and boosts the classification performance.
The extension reduces the classification error on the permutation invariant
MNIST from 1.60% to 0.95% in a logistic network, and from 1.40% to 0.78% in a
network with rectified linear units. Results on CIFAR-10 indicate that the
method has a regularizing effect similar to dropout in fully connected
networks. Based on these promising results, adversarial back-propagation is
proposed as a stand-alone regularizing method that should be further
investigated
Regularization Methods for Generative Adversarial Networks: An Overview of Recent Studies
Despite its short history, Generative Adversarial Network (GAN) has been
extensively studied and used for various tasks, including its original purpose,
i.e., synthetic sample generation. However, applying GAN to different data
types with diverse neural network architectures has been hindered by its
limitation in training, where the model easily diverges. Such a notorious
training of GANs is well known and has been addressed in numerous studies.
Consequently, in order to make the training of GAN stable, numerous
regularization methods have been proposed in recent years. This paper reviews
the regularization methods that have been recently introduced, most of which
have been published in the last three years. Specifically, we focus on general
methods that can be commonly used regardless of neural network architectures.
To explore the latest research trends in the regularization for GANs, the
methods are classified into several groups by their operation principles, and
the differences between the methods are analyzed. Furthermore, to provide
practical knowledge of using these methods, we investigate popular methods that
have been frequently employed in state-of-the-art GANs. In addition, we discuss
the limitations in existing methods and propose future research directions
Deep Within-Class Covariance Analysis for Robust Audio Representation Learning
Convolutional Neural Networks (CNNs) can learn effective features, though
have been shown to suffer from a performance drop when the distribution of the
data changes from training to test data. In this paper we analyze the internal
representations of CNNs and observe that the representations of unseen data in
each class, spread more (with higher variance) in the embedding space of the
CNN compared to representations of the training data. More importantly, this
difference is more extreme if the unseen data comes from a shifted
distribution. Based on this observation, we objectively evaluate the degree of
representation's variance in each class via eigenvalue decomposition on the
within-class covariance of the internal representations of CNNs and observe the
same behaviour. This can be problematic as larger variances might lead to
mis-classification if the sample crosses the decision boundary of its class. We
apply nearest neighbor classification on the representations and empirically
show that the embeddings with the high variance actually have significantly
worse KNN classification performances, although this could not be foreseen from
their end-to-end classification results. To tackle this problem, we propose
Deep Within-Class Covariance Analysis (DWCCA), a deep neural network layer that
significantly reduces the within-class covariance of a DNN's representation,
improving performance on unseen test data from a shifted distribution. We
empirically evaluate DWCCA on two datasets for Acoustic Scene Classification
(DCASE2016 and DCASE2017). We demonstrate that not only does DWCCA
significantly improve the network's internal representation, it also increases
the end-to-end classification accuracy, especially when the test set exhibits a
distribution shift. By adding DWCCA to a VGG network, we achieve around 6
percentage points improvement in the case of a distribution mismatch.Comment: 11 pages, 3 tables, 4 figure
Normalized Attention Without Probability Cage
Attention architectures are widely used; they recently gained renewed
popularity with Transformers yielding a streak of state of the art results.
Yet, the geometrical implications of softmax-attention remain largely
unexplored. In this work we highlight the limitations of constraining attention
weights to the probability simplex and the resulting convex hull of value
vectors. We show that Transformers are sequence length dependent biased towards
token isolation at initialization and contrast Transformers to simple max- and
sum-pooling - two strong baselines rarely reported. We propose to replace the
softmax in self-attention with normalization, yielding a hyperparameter and
data-bias robust, generally applicable architecture. We support our insights
with empirical results from more than 25,000 trained models. All results and
implementations are made available.Comment: Preprint, work in progress. Feedback welcom
- …