10,071 research outputs found
Approximating Real-Time Recurrent Learning with Random Kronecker Factors
Despite all the impressive advances of recurrent neural networks, sequential
data is still in need of better modelling. Truncated backpropagation through
time (TBPTT), the learning algorithm most widely used in practice, suffers from
the truncation bias, which drastically limits its ability to learn long-term
dependencies.The Real-Time Recurrent Learning algorithm (RTRL) addresses this
issue, but its high computational requirements make it infeasible in practice.
The Unbiased Online Recurrent Optimization algorithm (UORO) approximates RTRL
with a smaller runtime and memory cost, but with the disadvantage of obtaining
noisy gradients that also limit its practical applicability. In this paper we
propose the Kronecker Factored RTRL (KF-RTRL) algorithm that uses a Kronecker
product decomposition to approximate the gradients for a large class of RNNs.
We show that KF-RTRL is an unbiased and memory efficient online learning
algorithm. Our theoretical analysis shows that, under reasonable assumptions,
the noise introduced by our algorithm is not only stable over time but also
asymptotically much smaller than the one of the UORO algorithm. We also confirm
these theoretical results experimentally. Further, we show empirically that the
KF-RTRL algorithm captures long-term dependencies and almost matches the
performance of TBPTT on real world tasks by training Recurrent Highway Networks
on a synthetic string memorization task and on the Penn TreeBank task,
respectively. These results indicate that RTRL based approaches might be a
promising future alternative to TBPTT
High Order Recurrent Neural Networks for Acoustic Modelling
Vanishing long-term gradients are a major issue in training standard
recurrent neural networks (RNNs), which can be alleviated by long short-term
memory (LSTM) models with memory cells. However, the extra parameters
associated with the memory cells mean an LSTM layer has four times as many
parameters as an RNN with the same hidden vector size. This paper addresses the
vanishing gradient problem using a high order RNN (HORNN) which has additional
connections from multiple previous time steps. Speech recognition experiments
using British English multi-genre broadcast (MGB3) data showed that the
proposed HORNN architectures for rectified linear unit and sigmoid activation
functions reduced word error rates (WER) by 4.2% and 6.3% over the
corresponding RNNs, and gave similar WERs to a (projected) LSTM while using
only 20%--50% of the recurrent layer parameters and computation.Comment: 5 pages, 2 figures, 2 tables, to appear in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP 2018
Strongly-Typed Recurrent Neural Networks
Recurrent neural networks are increasing popular models for sequential
learning. Unfortunately, although the most effective RNN architectures are
perhaps excessively complicated, extensive searches have not found simpler
alternatives. This paper imports ideas from physics and functional programming
into RNN design to provide guiding principles. From physics, we introduce type
constraints, analogous to the constraints that forbids adding meters to
seconds. From functional programming, we require that strongly-typed
architectures factorize into stateless learnware and state-dependent firmware,
reducing the impact of side-effects. The features learned by strongly-typed
nets have a simple semantic interpretation via dynamic average-pooling on
one-dimensional convolutions. We also show that strongly-typed gradients are
better behaved than in classical architectures, and characterize the
representational power of strongly-typed nets. Finally, experiments show that,
despite being more constrained, strongly-typed architectures achieve lower
training and comparable generalization error to classical architectures.Comment: 10 pages, final version, ICML 201
Kronecker Recurrent Units
Our work addresses two important issues with recurrent neural networks: (1)
they are over-parameterized, and (2) the recurrence matrix is ill-conditioned.
The former increases the sample complexity of learning and the training time.
The latter causes the vanishing and exploding gradient problem. We present a
flexible recurrent neural network model called Kronecker Recurrent Units (KRU).
KRU achieves parameter efficiency in RNNs through a Kronecker factored
recurrent matrix. It overcomes the ill-conditioning of the recurrent matrix by
enforcing soft unitary constraints on the factors. Thanks to the small
dimensionality of the factors, maintaining these constraints is computationally
efficient. Our experimental results on seven standard data-sets reveal that KRU
can reduce the number of parameters by three orders of magnitude in the
recurrent weight matrix compared to the existing recurrent models, without
trading the statistical performance. These results in particular show that
while there are advantages in having a high dimensional recurrent space, the
capacity of the recurrent part of the model can be dramatically reduced
Complex Gated Recurrent Neural Networks
Complex numbers have long been favoured for digital signal processing, yet
complex representations rarely appear in deep learning architectures. RNNs,
widely used to process time series and sequence information, could greatly
benefit from complex representations. We present a novel complex gated
recurrent cell, which is a hybrid cell combining complex-valued and
norm-preserving state transitions with a gating mechanism. The resulting RNN
exhibits excellent stability and convergence properties and performs
competitively on the synthetic memory and adding task, as well as on the
real-world tasks of human motion prediction
NEUZZ: Efficient Fuzzing with Neural Program Smoothing
Fuzzing has become the de facto standard technique for finding software
vulnerabilities. However, even state-of-the-art fuzzers are not very efficient
at finding hard-to-trigger software bugs. Most popular fuzzers use evolutionary
guidance to generate inputs that can trigger different bugs. Such evolutionary
algorithms, while fast and simple to implement, often get stuck in fruitless
sequences of random mutations. Gradient-guided optimization presents a
promising alternative to evolutionary guidance. Gradient-guided techniques have
been shown to significantly outperform evolutionary algorithms at solving
high-dimensional structured optimization problems in domains like machine
learning by efficiently utilizing gradients or higher-order derivatives of the
underlying function. However, gradient-guided approaches are not directly
applicable to fuzzing as real-world program behaviors contain many
discontinuities, plateaus, and ridges where the gradient-based methods often
get stuck. We observe that this problem can be addressed by creating a smooth
surrogate function approximating the discrete branching behavior of target
program. In this paper, we propose a novel program smoothing technique using
surrogate neural network models that can incrementally learn smooth
approximations of a complex, real-world program's branching behaviors. We
further demonstrate that such neural network models can be used together with
gradient-guided input generation schemes to significantly improve the fuzzing
efficiency. Our extensive evaluations demonstrate that NEUZZ significantly
outperforms 10 state-of-the-art graybox fuzzers on 10 real-world programs both
at finding new bugs and achieving higher edge coverage. NEUZZ found 31 unknown
bugs that other fuzzers failed to find in 10 real world programs and achieved
3X more edge coverage than all of the tested graybox fuzzers for 24 hours
running.Comment: To appear in the 40th IEEE Symposium on Security and Privacy, May
20--22, 2019, San Francisco, CA, US
The unreasonable effectiveness of the forget gate
Given the success of the gated recurrent unit, a natural question is whether
all the gates of the long short-term memory (LSTM) network are necessary.
Previous research has shown that the forget gate is one of the most important
gates in the LSTM. Here we show that a forget-gate-only version of the LSTM
with chrono-initialized biases, not only provides computational savings but
outperforms the standard LSTM on multiple benchmark datasets and competes with
some of the best contemporary models. Our proposed network, the JANET, achieves
accuracies of 99% and 92.5% on the MNIST and pMNIST datasets, outperforming the
standard LSTM which yields accuracies of 98.5% and 91%.Comment: Corrected LSTM gradient derivations. Added link to cod
Rotational Unit of Memory
The concepts of unitary evolution matrices and associative memory have
boosted the field of Recurrent Neural Networks (RNN) to state-of-the-art
performance in a variety of sequential tasks. However, RNN still have a limited
capacity to manipulate long-term memory. To bypass this weakness the most
successful applications of RNN use external techniques such as attention
mechanisms. In this paper we propose a novel RNN model that unifies the
state-of-the-art approaches: Rotational Unit of Memory (RUM). The core of RUM
is its rotational operation, which is, naturally, a unitary matrix, providing
architectures with the power to learn long-term dependencies by overcoming the
vanishing and exploding gradients problem. Moreover, the rotational unit also
serves as associative memory. We evaluate our model on synthetic memorization,
question answering and language modeling tasks. RUM learns the Copying Memory
task completely and improves the state-of-the-art result in the Recall task.
RUM's performance in the bAbI Question Answering task is comparable to that of
models with attention mechanism. We also improve the state-of-the-art result to
1.189 bits-per-character (BPC) loss in the Character Level Penn Treebank (PTB)
task, which is to signify the applications of RUM to real-world sequential
data. The universality of our construction, at the core of RNN, establishes RUM
as a promising approach to language modeling, speech recognition and machine
translation
Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery
Reinforcement learning (RL) agents performing complex tasks must be able to
remember observations and actions across sizable time intervals. This is
especially true during the initial learning stages, when exploratory behaviour
can increase the delay between specific actions and their effects. Many new or
popular approaches for learning these distant correlations employ
backpropagation through time (BPTT), but this technique requires storing
observation traces long enough to span the interval between cause and effect.
Besides memory demands, learning dynamics like vanishing gradients and slow
convergence due to infrequent weight updates can reduce BPTT's practicality;
meanwhile, although online recurrent network learning is a developing topic,
most approaches are not efficient enough to use as replacements. We propose a
simple, effective memory strategy that can extend the window over which BPTT
can learn without requiring longer traces. We explore this approach empirically
on a few tasks and discuss its implications
Stable Recurrent Models
Stability is a fundamental property of dynamical systems, yet to this date it
has had little bearing on the practice of recurrent neural networks. In this
work, we conduct a thorough investigation of stable recurrent models.
Theoretically, we prove stable recurrent neural networks are well approximated
by feed-forward networks for the purpose of both inference and training by
gradient descent. Empirically, we demonstrate stable recurrent models often
perform as well as their unstable counterparts on benchmark sequence tasks.
Taken together, these findings shed light on the effective power of recurrent
networks and suggest much of sequence learning happens, or can be made to
happen, in the stable regime. Moreover, our results help to explain why in many
cases practitioners succeed in replacing recurrent models by feed-forward
models.Comment: To appear in ICLR 2019. This paper was previously titled "When
Recurrent Models Don't Need to Be Recurrent." The current version subsumes
all previous version
- …