77,849 research outputs found

    Understanding and Improving Layer Normalization

    Full text link
    Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.Comment: Accepted by NeurIPS 201

    Learning Neural Network Classifiers with Low Model Complexity

    Full text link
    Modern neural network architectures for large-scale learning tasks have substantially higher model complexities, which makes understanding, visualizing and training these architectures difficult. Recent contributions to deep learning techniques have focused on architectural modifications to improve parameter efficiency and performance. In this paper, we derive a continuous and differentiable error functional for a neural network that minimizes its empirical error as well as a measure of the model complexity. The latter measure is obtained by deriving a differentiable upper bound on the Vapnik-Chervonenkis (VC) dimension of the classifier layer of a class of deep networks. Using standard backpropagation, we realize a training rule that tries to minimize the error on training samples, while improving generalization by keeping the model complexity low. We demonstrate the effectiveness of our formulation (the Low Complexity Neural Network - LCNN) across several deep learning algorithms, and a variety of large benchmark datasets. We show that hidden layer neurons in the resultant networks learn features that are crisp, and in the case of image datasets, quantitatively sharper. Our proposed approach yields benefits across a wide range of architectures, in comparison to and in conjunction with methods such as Dropout and Batch Normalization, and our results strongly suggest that deep learning techniques can benefit from model complexity control methods such as the LCNN learning rule

    Improving Video Generation for Multi-functional Applications

    Full text link
    In this paper, we aim to improve the state-of-the-art video generative adversarial networks (GANs) with a view towards multi-functional applications. Our improved video GAN model does not separate foreground from background nor dynamic from static patterns, but learns to generate the entire video clip conjointly. Our model can thus be trained to generate - and learn from - a broad set of videos with no restriction. This is achieved by designing a robust one-stream video generation architecture with an extension of the state-of-the-art Wasserstein GAN framework that allows for better convergence. The experimental results show that our improved video GAN model outperforms state-of-theart video generative models on multiple challenging datasets. Furthermore, we demonstrate the superiority of our model by successfully extending it to three challenging problems: video colorization, video inpainting, and future prediction. To the best of our knowledge, this is the first work using GANs to colorize and inpaint video clips

    Static Activation Function Normalization

    Full text link
    Recent seminal work at the intersection of deep neural networks practice and random matrix theory has linked the convergence speed and robustness of these networks with the combination of random weight initialization and nonlinear activation function in use. Building on those principles, we introduce a process to transform an existing activation function into another one with better properties. We term such transform \emph{static activation normalization}. More specifically we focus on this normalization applied to the ReLU unit, and show empirically that it significantly promotes convergence robustness, maximum training depth, and anytime performance. We verify these claims by examining empirical eigenvalue distributions of networks trained with those activations. Our static activation normalization provides a first step towards giving benefits similar in spirit to schemes like batch normalization, but without computational cost

    Understanding Dropout as an Optimization Trick

    Full text link
    As one of standard approaches to train deep neural networks, dropout has been applied to regularize large models to avoid overfitting, and the improvement in performance by dropout has been explained as avoiding co-adaptation between nodes. However, when correlations between nodes are compared after training the networks with or without dropout, one question arises if co-adaptation avoidance explains the dropout effect completely. In this paper, we propose an additional explanation of why dropout works and propose a new technique to design better activation functions. First, we show that dropout can be explained as an optimization technique to push the input towards the saturation area of nonlinear activation function by accelerating gradient information flowing even in the saturation area in backpropagation. Based on this explanation, we propose a new technique for activation functions, {\em gradient acceleration in activation function (GAAF)}, that accelerates gradients to flow even in the saturation area. Then, input to the activation function can climb onto the saturation area which makes the network more robust because the model converges on a flat region. Experiment results support our explanation of dropout and confirm that the proposed GAAF technique improves image classification performance with expected properties.Comment: 16 page

    Iterative Normalization: Beyond Standardization towards Efficient Whitening

    Full text link
    Batch Normalization (BN) is ubiquitously employed for accelerating neural network training and improving the generalization capability by performing standardization within mini-batches. Decorrelated Batch Normalization (DBN) further boosts the above effectiveness by whitening. However, DBN relies heavily on either a large batch size, or eigen-decomposition that suffers from poor efficiency on GPUs. We propose Iterative Normalization (IterNorm), which employs Newton's iterations for much more efficient whitening, while simultaneously avoiding the eigen-decomposition. Furthermore, we develop a comprehensive study to show IterNorm has better trade-off between optimization and generalization, with theoretical and experimental support. To this end, we exclusively introduce Stochastic Normalization Disturbance (SND), which measures the inherent stochastic uncertainty of samples when applied to normalization operations. With the support of SND, we provide natural explanations to several phenomena from the perspective of optimization, e.g., why group-wise whitening of DBN generally outperforms full-whitening and why the accuracy of BN degenerates with reduced batch sizes. We demonstrate the consistently improved performance of IterNorm with extensive experiments on CIFAR-10 and ImageNet over BN and DBN.Comment: Accepted to CVPR 2019. The Code is available at https://github.com/huangleiBuaa/IterNor

    Improving Back-Propagation by Adding an Adversarial Gradient

    Full text link
    The back-propagation algorithm is widely used for learning in artificial neural networks. A challenge in machine learning is to create models that generalize to new data samples not seen in the training data. Recently, a common flaw in several machine learning algorithms was discovered: small perturbations added to the input data lead to consistent misclassification of data samples. Samples that easily mislead the model are called adversarial examples. Training a "maxout" network on adversarial examples has shown to decrease this vulnerability, but also increase classification performance. This paper shows that adversarial training has a regularizing effect also in networks with logistic, hyperbolic tangent and rectified linear units. A simple extension to the back-propagation method is proposed, that adds an adversarial gradient to the training. The extension requires an additional forward and backward pass to calculate a modified input sample, or mini batch, used as input for standard back-propagation learning. The first experimental results on MNIST show that the "adversarial back-propagation" method increases the resistance to adversarial examples and boosts the classification performance. The extension reduces the classification error on the permutation invariant MNIST from 1.60% to 0.95% in a logistic network, and from 1.40% to 0.78% in a network with rectified linear units. Results on CIFAR-10 indicate that the method has a regularizing effect similar to dropout in fully connected networks. Based on these promising results, adversarial back-propagation is proposed as a stand-alone regularizing method that should be further investigated

    Regularization Methods for Generative Adversarial Networks: An Overview of Recent Studies

    Full text link
    Despite its short history, Generative Adversarial Network (GAN) has been extensively studied and used for various tasks, including its original purpose, i.e., synthetic sample generation. However, applying GAN to different data types with diverse neural network architectures has been hindered by its limitation in training, where the model easily diverges. Such a notorious training of GANs is well known and has been addressed in numerous studies. Consequently, in order to make the training of GAN stable, numerous regularization methods have been proposed in recent years. This paper reviews the regularization methods that have been recently introduced, most of which have been published in the last three years. Specifically, we focus on general methods that can be commonly used regardless of neural network architectures. To explore the latest research trends in the regularization for GANs, the methods are classified into several groups by their operation principles, and the differences between the methods are analyzed. Furthermore, to provide practical knowledge of using these methods, we investigate popular methods that have been frequently employed in state-of-the-art GANs. In addition, we discuss the limitations in existing methods and propose future research directions

    Deep Within-Class Covariance Analysis for Robust Audio Representation Learning

    Full text link
    Convolutional Neural Networks (CNNs) can learn effective features, though have been shown to suffer from a performance drop when the distribution of the data changes from training to test data. In this paper we analyze the internal representations of CNNs and observe that the representations of unseen data in each class, spread more (with higher variance) in the embedding space of the CNN compared to representations of the training data. More importantly, this difference is more extreme if the unseen data comes from a shifted distribution. Based on this observation, we objectively evaluate the degree of representation's variance in each class via eigenvalue decomposition on the within-class covariance of the internal representations of CNNs and observe the same behaviour. This can be problematic as larger variances might lead to mis-classification if the sample crosses the decision boundary of its class. We apply nearest neighbor classification on the representations and empirically show that the embeddings with the high variance actually have significantly worse KNN classification performances, although this could not be foreseen from their end-to-end classification results. To tackle this problem, we propose Deep Within-Class Covariance Analysis (DWCCA), a deep neural network layer that significantly reduces the within-class covariance of a DNN's representation, improving performance on unseen test data from a shifted distribution. We empirically evaluate DWCCA on two datasets for Acoustic Scene Classification (DCASE2016 and DCASE2017). We demonstrate that not only does DWCCA significantly improve the network's internal representation, it also increases the end-to-end classification accuracy, especially when the test set exhibits a distribution shift. By adding DWCCA to a VGG network, we achieve around 6 percentage points improvement in the case of a distribution mismatch.Comment: 11 pages, 3 tables, 4 figure

    Normalized Attention Without Probability Cage

    Full text link
    Attention architectures are widely used; they recently gained renewed popularity with Transformers yielding a streak of state of the art results. Yet, the geometrical implications of softmax-attention remain largely unexplored. In this work we highlight the limitations of constraining attention weights to the probability simplex and the resulting convex hull of value vectors. We show that Transformers are sequence length dependent biased towards token isolation at initialization and contrast Transformers to simple max- and sum-pooling - two strong baselines rarely reported. We propose to replace the softmax in self-attention with normalization, yielding a hyperparameter and data-bias robust, generally applicable architecture. We support our insights with empirical results from more than 25,000 trained models. All results and implementations are made available.Comment: Preprint, work in progress. Feedback welcom
    corecore