8,423 research outputs found
Robust and Scalable Differentiable Neural Computer for Question Answering
Deep learning models are often not easily adaptable to new tasks and require
task-specific adjustments. The differentiable neural computer (DNC), a
memory-augmented neural network, is designed as a general problem solver which
can be used in a wide range of tasks. But in reality, it is hard to apply this
model to new tasks. We analyze the DNC and identify possible improvements
within the application of question answering. This motivates a more robust and
scalable DNC (rsDNC). The objective precondition is to keep the general
character of this model intact while making its application more reliable and
speeding up its required training time. The rsDNC is distinguished by a more
robust training, a slim memory unit and a bidirectional architecture. We not
only achieve new state-of-the-art performance on the bAbI task, but also
minimize the performance variance between different initializations.
Furthermore, we demonstrate the simplified applicability of the rsDNC to new
tasks with passable results on the CNN RC task without adaptions.Comment: Accepted at Workshop on Machine Reading for Question Answering
(MRQA), ACL 2018. 14 pages, 5 figure
Reducing state updates via Gaussian-gated LSTMs
Recurrent neural networks can be difficult to train on long sequence data due
to the well-known vanishing gradient problem. Some architectures incorporate
methods to reduce RNN state updates, therefore allowing the network to preserve
memory over long temporal intervals. To address these problems of convergence,
this paper proposes a timing-gated LSTM RNN model, called the Gaussian-gated
LSTM (g-LSTM). The time gate controls when a neuron can be updated during
training, enabling longer memory persistence and better error-gradient flow.
This model captures long-temporal dependencies better than an LSTM and the time
gate parameters can be learned even from non-optimal initialization values.
Because the time gate limits the updates of the neuron state, the number of
computes needed for the network update is also reduced. By adding a
computational budget term to the training loss, we can obtain a network which
further reduces the number of computes by at least 10x. Finally, by employing a
temporal curriculum learning schedule for the g-LSTM, we can reduce the
convergence time of the equivalent LSTM network on long sequences
Regularizing RNNs by Stabilizing Activations
We stabilize the activations of Recurrent Neural Networks (RNNs) by
penalizing the squared distance between successive hidden states' norms.
This penalty term is an effective regularizer for RNNs including LSTMs and
IRNNs, improving performance on character-level language modeling and phoneme
recognition, and outperforming weight noise and dropout.
We achieve competitive performance (18.6\% PER) on the TIMIT phoneme
recognition task for RNNs evaluated without beam search or an RNN transducer.
With this penalty term, IRNN can achieve similar performance to LSTM on
language modeling, although adding the penalty term to the LSTM results in
superior performance.
Our penalty term also prevents the exponential growth of IRNN's activations
outside of their training horizon, allowing them to generalize to much longer
sequences
Feature Selection using Stochastic Gates
Feature selection problems have been extensively studied for linear
estimation, for instance, Lasso, but less emphasis has been placed on feature
selection for non-linear functions. In this study, we propose a method for
feature selection in high-dimensional non-linear function estimation problems.
The new procedure is based on minimizing the norm of the vector of
indicator variables that represent if a feature is selected or not. Our
approach relies on the continuous relaxation of Bernoulli distributions, which
allows our model to learn the parameters of the approximate Bernoulli
distributions via gradient descent. This general framework simultaneously
minimizes a loss function while selecting relevant features. Furthermore, we
provide an information-theoretic justification of incorporating Bernoulli
distribution into our approach and demonstrate the potential of the approach on
synthetic and real-life applications.Comment: Published in ICML 202
Why are deep nets reversible: A simple theory, with implications for training
Generative models for deep learning are promising both to improve
understanding of the model, and yield training methods requiring fewer labeled
samples.
Recent works use generative model approaches to produce the deep net's input
given the value of a hidden layer several levels above. However, there is no
accompanying "proof of correctness" for the generative model, showing that the
feedforward deep net is the correct inference method for recovering the hidden
layer given the input. Furthermore, these models are complicated.
The current paper takes a more theoretical tack. It presents a very simple
generative model for RELU deep nets, with the following characteristics: (i)
The generative model is just the reverse of the feedforward net: if the forward
transformation at a layer is then the reverse transformation is .
(This can be seen as an explanation of the old weight tying idea for denoising
autoencoders.) (ii) Its correctness can be proven under a clean theoretical
assumption: the edge weights in real-life deep nets behave like random numbers.
Under this assumption ---which is experimentally tested on real-life nets like
AlexNet--- it is formally proved that feed forward net is a correct inference
method for recovering the hidden layer.
The generative model suggests a simple modification for training: use the
generative model to produce synthetic data with labels and include it in the
training set. Experiments are shown to support this theory of random-like deep
nets; and that it helps the training
LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence
Optimizing deep neural networks is largely thought to be an empirical
process, requiring manual tuning of several hyper-parameters, such as learning
rate, weight decay, and dropout rate. Arguably, the learning rate is the most
important of these to tune, and this has gained more attention in recent works.
In this paper, we propose a novel method to compute the learning rate for
training deep neural networks with stochastic gradient descent. We first derive
a theoretical framework to compute learning rates dynamically based on the
Lipschitz constant of the loss function. We then extend this framework to other
commonly used optimization algorithms, such as gradient descent with momentum
and Adam. We run an extensive set of experiments that demonstrate the efficacy
of our approach on popular architectures and datasets, and show that commonly
used learning rates are an order of magnitude smaller than the ideal value.Comment: v4; comparison studies adde
Artificial neural networks condensation: A strategy to facilitate adaption of machine learning in medical settings by reducing computational burden
Machine Learning (ML) applications on healthcare can have a great impact on
people's lives helping deliver better and timely treatment to those in need. At
the same time, medical data is usually big and sparse requiring important
computational resources. Although it might not be a problem for wide-adoption
of ML tools in developed nations, availability of computational resource can
very well be limited in third-world nations. This can prevent the less favored
people from benefiting of the advancement in ML applications for healthcare. In
this project we explored methods to increase computational efficiency of ML
algorithms, in particular Artificial Neural Nets (NN), while not compromising
the accuracy of the predicted results. We used in-hospital mortality prediction
as our case analysis based on the MIMIC III publicly available dataset. We
explored three methods on two different NN architectures. We reduced the size
of recurrent neural net (RNN) and dense neural net (DNN) by applying pruning of
"unused" neurons. Additionally, we modified the RNN structure by adding a
hidden-layer to the LSTM cell allowing to use less recurrent layers for the
model. Finally, we implemented quantization on DNN forcing the weights to be
8-bits instead of 32-bits. We found that all our methods increased
computational efficiency without compromising accuracy and some of them even
achieved higher accuracy than the pre-condensed baseline models
A Selective Overview of Deep Learning
Deep learning has arguably achieved tremendous success in recent years. In
simple words, deep learning uses the composition of many nonlinear functions to
model the complex dependency between input features and labels. While neural
networks have a long history, recent advances have greatly improved their
performance in computer vision, natural language processing, etc. From the
statistical and scientific perspective, it is natural to ask: What is deep
learning? What are the new characteristics of deep learning, compared with
classical methods? What are the theoretical foundations of deep learning? To
answer these questions, we introduce common neural network models (e.g.,
convolutional neural nets, recurrent neural nets, generative adversarial nets)
and training techniques (e.g., stochastic gradient descent, dropout, batch
normalization) from a statistical point of view. Along the way, we highlight
new characteristics of deep learning (including depth and over-parametrization)
and explain their practical and theoretical benefits. We also sample recent
results on theories of deep learning, many of which are only suggestive. While
a complete understanding of deep learning remains elusive, we hope that our
perspectives and discussions serve as a stimulus for new statistical research
Towards the AlexNet Moment for Homomorphic Encryption: HCNN, theFirst Homomorphic CNN on Encrypted Data with GPUs
Deep Learning as a Service (DLaaS) stands as a promising solution for
cloud-based inference applications. In this setting, the cloud has a
pre-learned model whereas the user has samples on which she wants to run the
model. The biggest concern with DLaaS is user privacy if the input samples are
sensitive data. We provide here an efficient privacy-preserving system by
employing high-end technologies such as Fully Homomorphic Encryption (FHE),
Convolutional Neural Networks (CNNs) and Graphics Processing Units (GPUs). FHE,
with its widely-known feature of computing on encrypted data, empowers a wide
range of privacy-concerned applications. This comes at high cost as it requires
enormous computing power. In this paper, we show how to accelerate the
performance of running CNNs on encrypted data with GPUs. We evaluated two CNNs
to classify homomorphically the MNIST and CIFAR-10 datasets. Our solution
achieved a sufficient security level (> 80 bit) and reasonable classification
accuracy (99%) and (77.55%) for MNIST and CIFAR-10, respectively. In terms of
latency, we could classify an image in 5.16 seconds and 304.43 seconds for
MNIST and CIFAR-10, respectively. Our system can also classify a batch of
images (> 8,000) without extra overhead
SADA: Semantic Adversarial Diagnostic Attacks for Autonomous Applications
One major factor impeding more widespread adoption of deep neural networks
(DNNs) is their lack of robustness, which is essential for safety-critical
applications such as autonomous driving. This has motivated much recent work on
adversarial attacks for DNNs, which mostly focus on pixel-level perturbations
void of semantic meaning. In contrast, we present a general framework for
adversarial attacks on trained agents, which covers semantic perturbations to
the environment of the agent performing the task as well as pixel-level
attacks. To do this, we re-frame the adversarial attack problem as learning a
distribution of parameters that always fools the agent. In the semantic case,
our proposed adversary (denoted as BBGAN) is trained to sample parameters that
describe the environment with which the black-box agent interacts, such that
the agent performs its dedicated task poorly in this environment. We apply
BBGAN on three different tasks, primarily targeting aspects of autonomous
navigation: object detection, self-driving, and autonomous UAV racing. On these
tasks, BBGAN can generate failure cases that consistently fool a trained agent.Comment: Accepted at AAAI'2
- …