34,350 research outputs found
Adding Gradient Noise Improves Learning for Very Deep Networks
Deep feedforward and recurrent networks have achieved impressive results in
many perception and language processing applications. This success is partially
attributed to architectural innovations such as convolutional and long
short-term memory networks. The main motivation for these architectural
innovations is that they capture better domain knowledge, and importantly are
easier to optimize than more basic architectures. Recently, more complex
architectures such as Neural Turing Machines and Memory Networks have been
proposed for tasks including question answering and general computation,
creating a new set of optimization challenges. In this paper, we discuss a
low-overhead and easy-to-implement technique of adding gradient noise which we
find to be surprisingly effective when training these very deep architectures.
The technique not only helps to avoid overfitting, but also can result in lower
training loss. This method alone allows a fully-connected 20-layer deep network
to be trained with standard gradient descent, even starting from a poor
initialization. We see consistent improvements for many complex models,
including a 72% relative reduction in error rate over a carefully-tuned
baseline on a challenging question-answering task, and a doubling of the number
of accurate binary multiplication models learned across 7,000 random restarts.
We encourage further application of this technique to additional complex modern
architectures
Improving Noise Tolerance of Mixed-Signal Neural Networks
Mixed-signal hardware accelerators for deep learning achieve orders of
magnitude better power efficiency than their digital counterparts. In the
ultra-low power consumption regime, limited signal precision inherent to analog
computation becomes a challenge. We perform a case study of a 6-layer
convolutional neural network running on a mixed-signal accelerator and evaluate
its sensitivity to hardware specific noise. We apply various methods to improve
noise robustness of the network and demonstrate an effective way to optimize
useful signal ranges through adaptive signal clipping. The resulting model is
robust enough to achieve 80.2% classification accuracy on CIFAR-10 dataset with
just 1.4 mW power budget, while 6 mW budget allows us to achieve 87.1%
accuracy, which is within 1% of the software baseline. For comparison, the
unoptimized version of the same model achieves only 67.7% accuracy at 1.4 mW
and 78.6% at 6 mW.Comment: Accepted for publication in IJCNN 201
Differentially Private Dropout
Large data collections required for the training of neural networks often
contain sensitive information such as the medical histories of patients, and
the privacy of the training data must be preserved. In this paper, we introduce
a dropout technique that provides an elegant Bayesian interpretation to
dropout, and show that the intrinsic noise added, with the primary goal of
regularization, can be exploited to obtain a degree of differential privacy.
The iterative nature of training neural networks presents a challenge for
privacy-preserving estimation since multiple iterations increase the amount of
noise added. We overcome this by using a relaxed notion of differential
privacy, called concentrated differential privacy, which provides tighter
estimates on the overall privacy loss. We demonstrate the accuracy of our
privacy-preserving dropout algorithm on benchmark datasets.Comment: arXiv admin note: text overlap with arXiv:1611.00340 by other author
Introducing Noise in Decentralized Training of Neural Networks
It has been shown that injecting noise into the neural network weights during
the training process leads to a better generalization of the resulting model.
Noise injection in the distributed setup is a straightforward technique and it
represents a promising approach to improve the locally trained models. We
investigate the effects of noise injection into the neural networks during a
decentralized training process. We show both theoretically and empirically that
noise injection has no positive effect in expectation on linear models, though.
However for non-linear neural networks we empirically show that noise injection
substantially improves model quality helping to reach a generalization ability
of a local model close to the serial baseline.Comment: 13 page
Differentially Private Variational Dropout
Deep neural networks with their large number of parameters are highly
flexible learning systems. The high flexibility in such networks brings with
some serious problems such as overfitting, and regularization is used to
address this problem. A currently popular and effective regularization
technique for controlling the overfitting is dropout. Often, large data
collections required for neural networks contain sensitive information such as
the medical histories of patients, and the privacy of the training data should
be protected. In this paper, we modify the recently proposed variational
dropout technique which provided an elegant Bayesian interpretation to dropout,
and show that the intrinsic noise in the variational dropout can be exploited
to obtain a degree of differential privacy. The iterative nature of training
neural networks presents a challenge for privacy-preserving estimation since
multiple iterations increase the amount of noise added. We overcome this by
using a relaxed notion of differential privacy, called concentrated
differential privacy, which provides tighter estimates on the overall privacy
loss. We demonstrate the accuracy of our privacy-preserving variational dropout
algorithm on benchmark datasets.Comment: arXiv admin note: substantial text overlap with arXiv:1712.0166
CEM-RL: Combining evolutionary and gradient-based methods for policy search
Deep neuroevolution and deep reinforcement learning (deep RL) algorithms are
two popular approaches to policy search. The former is widely applicable and
rather stable, but suffers from low sample efficiency. By contrast, the latter
is more sample efficient, but the most sample efficient variants are also
rather unstable and highly sensitive to hyper-parameter setting. So far, these
families of methods have mostly been compared as competing tools. However, an
emerging approach consists in combining them so as to get the best of both
worlds. Two previously existing combinations use either an ad hoc evolutionary
algorithm or a goal exploration process together with the Deep Deterministic
Policy Gradient (DDPG) algorithm, a sample efficient off-policy deep RL
algorithm. In this paper, we propose a different combination scheme using the
simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy
gradient (td3), another off-policy deep RL algorithm which improves over ddpg.
We evaluate the resulting method, cem-rl, on a set of benchmarks classically
used in deep RL. We show that cem-rl benefits from several advantages over its
competitors and offers a satisfactory trade-off between performance and sample
efficiency.Comment: accepted at ICLR 201
Differentially Private Generative Adversarial Network
Generative Adversarial Network (GAN) and its variants have recently attracted
intensive research interests due to their elegant theoretical foundation and
excellent empirical performance as generative models. These tools provide a
promising direction in the studies where data availability is limited. One
common issue in GANs is that the density of the learned generative distribution
could concentrate on the training data points, meaning that they can easily
remember training samples due to the high model complexity of deep networks.
This becomes a major concern when GANs are applied to private or sensitive data
such as patient medical records, and the concentration of distribution may
divulge critical patient information. To address this issue, in this paper we
propose a differentially private GAN (DPGAN) model, in which we achieve
differential privacy in GANs by adding carefully designed noise to gradients
during the learning procedure. We provide rigorous proof for the privacy
guarantee, as well as comprehensive empirical evidence to support our analysis,
where we demonstrate that our method can generate high quality data points at a
reasonable privacy level
Understanding Dropout as an Optimization Trick
As one of standard approaches to train deep neural networks, dropout has been
applied to regularize large models to avoid overfitting, and the improvement in
performance by dropout has been explained as avoiding co-adaptation between
nodes. However, when correlations between nodes are compared after training the
networks with or without dropout, one question arises if co-adaptation
avoidance explains the dropout effect completely. In this paper, we propose an
additional explanation of why dropout works and propose a new technique to
design better activation functions. First, we show that dropout can be
explained as an optimization technique to push the input towards the saturation
area of nonlinear activation function by accelerating gradient information
flowing even in the saturation area in backpropagation. Based on this
explanation, we propose a new technique for activation functions, {\em gradient
acceleration in activation function (GAAF)}, that accelerates gradients to flow
even in the saturation area. Then, input to the activation function can climb
onto the saturation area which makes the network more robust because the model
converges on a flat region. Experiment results support our explanation of
dropout and confirm that the proposed GAAF technique improves image
classification performance with expected properties.Comment: 16 page
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
Confidence penalty, annealing Gaussian noise and zoneout for biLSTM-CRF networks for named entity recognition
Named entity recognition (NER) is used to identify relevant entities in text.
A bidirectional LSTM (long short term memory) encoder with a neural conditional
random fields (CRF) decoder (biLSTM-CRF) is the state of the art methodology.
In this work, we have done an analysis of several methods that intend to
optimize the performance of networks based on this architecture, which in some
cases encourage overfitting avoidance. These methods target exploration of
parameter space, regularization of LSTMs and penalization of confident output
distributions. Results show that the optimization methods improve the
performance of the biLSTM-CRF NER baseline system, setting a new state of the
art performance for the CoNLL-2003 Spanish set with an F1 of 87.18
- …