2,067 research outputs found
Occam's Gates
We present a complimentary objective for training recurrent neural networks
(RNN) with gating units that helps with regularization and interpretability of
the trained model. Attention-based RNN models have shown success in many
difficult sequence to sequence classification problems with long and short term
dependencies, however these models are prone to overfitting. In this paper, we
describe how to regularize these models through an L1 penalty on the activation
of the gating units, and show that this technique reduces overfitting on a
variety of tasks while also providing to us a human-interpretable visualization
of the inputs used by the network. These tasks include sentiment analysis,
paraphrase recognition, and question answering.Comment: In review at NIP
Bayesian LSTMs in medicine
The medical field stands to see significant benefits from the recent advances
in deep learning. Knowing the uncertainty in the decision made by any machine
learning algorithm is of utmost importance for medical practitioners. This
study demonstrates the utility of using Bayesian LSTMs for classification of
medical time series. Four medical time series datasets are used to show the
accuracy improvement Bayesian LSTMs provide over standard LSTMs. Moreover, we
show cherry-picked examples of confident and uncertain classifications of the
medical time series. With simple modifications of the common practice for deep
learning, significant improvements can be made for the medical practitioner and
patient.Comment: 11 pages, 8 figure
Learning Intrinsic Sparse Structures within Long Short-Term Memory
Model compression is significant for the wide adoption of Recurrent Neural
Networks (RNNs) in both user devices possessing limited resources and business
clusters requiring quick responses to large-scale service requests. This work
aims to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the
sizes of basic structures within LSTM units, including input updates, gates,
hidden states, cell states and outputs. Independently reducing the sizes of
basic structures can result in inconsistent dimensions among them, and
consequently, end up with invalid LSTM units. To overcome the problem, we
propose Intrinsic Sparse Structures (ISS) in LSTMs. Removing a component of ISS
will simultaneously decrease the sizes of all basic structures by one and
thereby always maintain the dimension consistency. By learning ISS within LSTM
units, the obtained LSTMs remain regular while having much smaller basic
structures. Based on group Lasso regularization, our method achieves 10.59x
speedup without losing any perplexity of a language modeling of Penn TreeBank
dataset. It is also successfully evaluated through a compact model with only
2.69M weights for machine Question Answering of SQuAD dataset. Our approach is
successfully extended to non- LSTM RNNs, like Recurrent Highway Networks
(RHNs). Our source code is publicly available at
https://github.com/wenwei202/iss-rnnsComment: Published in ICLR 2018 ( the Sixth International Conference on
Learning Representations
Strongly-Typed Recurrent Neural Networks
Recurrent neural networks are increasing popular models for sequential
learning. Unfortunately, although the most effective RNN architectures are
perhaps excessively complicated, extensive searches have not found simpler
alternatives. This paper imports ideas from physics and functional programming
into RNN design to provide guiding principles. From physics, we introduce type
constraints, analogous to the constraints that forbids adding meters to
seconds. From functional programming, we require that strongly-typed
architectures factorize into stateless learnware and state-dependent firmware,
reducing the impact of side-effects. The features learned by strongly-typed
nets have a simple semantic interpretation via dynamic average-pooling on
one-dimensional convolutions. We also show that strongly-typed gradients are
better behaved than in classical architectures, and characterize the
representational power of strongly-typed nets. Finally, experiments show that,
despite being more constrained, strongly-typed architectures achieve lower
training and comparable generalization error to classical architectures.Comment: 10 pages, final version, ICML 201
Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables
Human activity recognition (HAR) in ubiquitous computing is beginning to
adopt deep learning to substitute for well-established analysis techniques that
rely on hand-crafted feature extraction and classification techniques. From
these isolated applications of custom deep architectures it is, however,
difficult to gain an overview of their suitability for problems ranging from
the recognition of manipulative gestures to the segmentation and identification
of physical activities like running or ascending stairs. In this paper we
rigorously explore deep, convolutional, and recurrent approaches across three
representative datasets that contain movement data captured with wearable
sensors. We describe how to train recurrent approaches in this setting,
introduce a novel regularisation approach, and illustrate how they outperform
the state-of-the-art on a large benchmark dataset. Across thousands of
recognition experiments with randomly sampled model configurations we
investigate the suitability of each model for different tasks in HAR, explore
the impact of hyperparameters using the fANOVA framework, and provide
guidelines for the practitioner who wants to apply deep learning in their
problem setting.Comment: Extended version has been accepted for publication at International
Joint Conference on Artificial Intelligence (IJCAI
Learning Sparse Structured Ensembles with SG-MCMC and Network Pruning
An ensemble of neural networks is known to be more robust and accurate than
an individual network, however usually with linearly-increased cost in both
training and testing. In this work, we propose a two-stage method to learn
Sparse Structured Ensembles (SSEs) for neural networks. In the first stage, we
run SG-MCMC with group sparse priors to draw an ensemble of samples from the
posterior distribution of network parameters. In the second stage, we apply
weight-pruning to each sampled network and then perform retraining over the
remained connections. In this way of learning SSEs with SG-MCMC and pruning, we
not only achieve high prediction accuracy since SG-MCMC enhances exploration of
the model-parameter space, but also reduce memory and computation cost
significantly in both training and testing of NN ensembles. This is thoroughly
evaluated in the experiments of learning SSE ensembles of both FNNs and LSTMs.
For example, in LSTM based language modeling (LM), we obtain 21% relative
reduction in LM perplexity by learning a SSE of 4 large LSTM models, which has
only 30% of model parameters and 70% of computations in total, as compared to
the baseline large LSTM LM. To the best of our knowledge, this work represents
the first methodology and empirical study of integrating SG-MCMC, group sparse
prior and network pruning together for learning NN ensembles
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
We propose zoneout, a novel method for regularizing RNNs. At each timestep,
zoneout stochastically forces some hidden units to maintain their previous
values. Like dropout, zoneout uses random noise to train a pseudo-ensemble,
improving generalization. But by preserving instead of dropping hidden units,
gradient information and state information are more readily propagated through
time, as in feedforward stochastic depth networks. We perform an empirical
investigation of various RNN regularizers, and find that zoneout gives
significant performance improvements across tasks. We achieve competitive
results with relatively simple models in character- and word-level language
modelling on the Penn Treebank and Text8 datasets, and combining with recurrent
batch normalization yields state-of-the-art results on permuted sequential
MNIST.Comment: David Krueger and Tegan Maharaj contributed equally to this wor
On the State of the Art of Evaluation in Neural Language Models
Ongoing innovations in recurrent neural network architectures have provided a
steady influx of apparently state-of-the-art results on language modelling
benchmarks. However, these have been evaluated using differing code bases and
limited computational resources, which represent uncontrolled sources of
experimental variation. We reevaluate several popular architectures and
regularisation methods with large-scale automatic black-box hyperparameter
tuning and arrive at the somewhat surprising conclusion that standard LSTM
architectures, when properly regularised, outperform more recent models. We
establish a new state of the art on the Penn Treebank and Wikitext-2 corpora,
as well as strong baselines on the Hutter Prize dataset
IndyLSTMs: Independently Recurrent LSTMs
We introduce Independently Recurrent Long Short-term Memory cells: IndyLSTMs.
These differ from regular LSTM cells in that the recurrent weights are not
modeled as a full matrix, but as a diagonal matrix, i.e.\ the output and state
of each LSTM cell depends on the inputs and its own output/state, as opposed to
the input and the outputs/states of all the cells in the layer. The number of
parameters per IndyLSTM layer, and thus the number of FLOPS per evaluation, is
linear in the number of nodes in the layer, as opposed to quadratic for regular
LSTM layers, resulting in potentially both smaller and faster models. We
evaluate their performance experimentally by training several models on the
popular \iamondb and CASIA online handwriting datasets, as well as on several
of our in-house datasets. We show that IndyLSTMs, despite their smaller size,
consistently outperform regular LSTMs both in terms of accuracy per parameter,
and in best accuracy overall. We attribute this improved performance to the
IndyLSTMs being less prone to overfitting.Comment: 8 pages, submitted to ICDAR 201
A Simple LSTM model for Transition-based Dependency Parsing
We present a simple LSTM-based transition-based dependency parser. Our model
is composed of a single LSTM hidden layer replacing the hidden layer in the
usual feed-forward network architecture. We also propose a new initialization
method that uses the pre-trained weights from a feed-forward neural network to
initialize our LSTM-based model. We also show that using dropout on the input
layer has a positive effect on performance. Our final parser achieves a 93.06%
unlabeled and 91.01% labeled attachment score on the Penn Treebank. We
additionally replace LSTMs with GRUs and Elman units in our model and explore
the effectiveness of our initialization method on individual gates constituting
all three types of RNN units
- …