13 research outputs found
Fast Parametric Learning with Activation Memorization
Neural networks trained with backpropagation often struggle to identify
classes that have been observed a small number of times. In applications where
most class labels are rare, such as language modelling, this can become a
performance bottleneck. One potential remedy is to augment the network with a
fast-learning non-parametric model which stores recent activations and class
labels into an external memory. We explore a simplified architecture where we
treat a subset of the model parameters as fast memory stores. This can help
retain information over longer time intervals than a traditional memory, and
does not require additional space or compute. In the case of image
classification, we display faster binding of novel classes on an Omniglot image
curriculum task. We also show improved performance for word-based language
models on news reports (GigaWord), books (Project Gutenberg) and Wikipedia
articles (WikiText-103) --- the latter achieving a state-of-the-art perplexity
of 29.2
Universal Language Model Fine-Tuning with Subword Tokenization for Polish
Universal Language Model for Fine-tuning [arXiv:1801.06146] (ULMFiT) is one
of the first NLP methods for efficient inductive transfer learning.
Unsupervised pretraining results in improvements on many NLP tasks for English.
In this paper, we describe a new method that uses subword tokenization to adapt
ULMFiT to languages with high inflection. Our approach results in a new
state-of-the-art for the Polish language, taking first place in Task 3 of
PolEval'18. After further training, our final model outperformed the second
best model by 35%. We have open-sourced our pretrained models and code.Comment: PolEval 2018 Worksho
Improved Language Modeling by Decoding the Past
Highly regularized LSTMs achieve impressive results on several benchmark
datasets in language modeling. We propose a new regularization method based on
decoding the last token in the context using the predicted distribution of the
next token. This biases the model towards retaining more contextual
information, in turn improving its ability to predict the next token. With
negligible overhead in the number of parameters and training time, our Past
Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on
the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax.
We also show gains by using PDR in combination with a mixture-of-softmaxes,
achieving a word level perplexity of 53.8 and 60.5 on these datasets. In
addition, our method achieves 1.169 bits-per-character on the Penn Treebank
Character dataset for character level language modeling. These results
constitute a new state-of-the-art in their respective settings
Metalearning with Hebbian Fast Weights
We unify recent neural approaches to one-shot learning with older ideas of
associative memory in a model for metalearning. Our model learns jointly to
represent data and to bind class labels to representations in a single shot. It
builds representations via slow weights, learned across tasks through SGD,
while fast weights constructed by a Hebbian learning rule implement one-shot
binding for each new task. On the Omniglot, Mini-ImageNet, and Penn Treebank
one-shot learning benchmarks, our model achieves state-of-the-art results.Comment: 8 pages, 3 figures, 4 tables. arXiv admin note: text overlap with
arXiv:1712.0992
Adaptive Input Representations for Neural Language Modeling
We introduce adaptive input representations for neural language modeling
which extend the adaptive softmax of Grave et al. (2017) to input
representations of variable capacity. There are several choices on how to
factorize the input and output layers, and whether to model words, characters
or sub-word units. We perform a systematic comparison of popular choices for a
self-attentional architecture. Our experiments show that models equipped with
adaptive embeddings are more than twice as fast to train than the popular
character input CNN while having a lower number of parameters. On the
WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5
perplexity compared to the previously best published result and on the Billion
Word benchmark, we achieve 23.02 perplexity.Comment: 12 page
Understanding Recurrent Neural Architectures by Analyzing and Synthesizing Long Distance Dependencies in Benchmark Sequential Datasets
In order to build efficient deep recurrent neural architectures, it
isessential to analyze the complexity of long distance dependencies(LDDs) of
the dataset being modeled. In this context, in this pa-per, we present detailed
analysis of the complexity and the degreeof LDDs (orLDD characteristics)
exhibited by various sequentialbenchmark datasets. We observe that the datasets
sampled from asimilar process or task (e.g. natural language, or sequential
MNIST,etc) display similar LDD characteristics. Upon analysing the
LDDcharacteristics, we were able to analyze the factors influencingthem; such
as (i) number of unique symbols in a dataset, (ii) sizeof the dataset, (iii)
number of interacting symbols within a givenLDD, and (iv) the distance between
the interacting symbols. Wedemonstrate that analysing LDD characteristics can
inform theselection of optimal hyper-parameters for SOTA deep recurrentneural
architectures. This analysis can directly contribute to thedevelopment of more
accurate and efficient sequential models. Wealso introduce the use of
Strictlyk-Piecewise languages as a pro-cess to generate synthesized datasets
for language modelling. Theadvantage of these synthesized datasets is that they
enable targetedtesting of deep recurrent neural architectures in terms of their
abil-ity to model LDDs with different characteristics. Moreover, usinga variety
of Strictlyk-Piecewise languages we generate a numberof new benchmarking
datasets, and analyse the performance of anumber of SOTA recurrent
architectures on these new benchmarks
Enabling Continual Learning with Differentiable Hebbian Plasticity
Continual learning is the problem of sequentially learning new tasks or
knowledge while protecting previously acquired knowledge. However, catastrophic
forgetting poses a grand challenge for neural networks performing such learning
process. Thus, neural networks that are deployed in the real world often
struggle in scenarios where the data distribution is non-stationary (concept
drift), imbalanced, or not always fully available, i.e., rare edge cases. We
propose a Differentiable Hebbian Consolidation model which is composed of a
Differentiable Hebbian Plasticity (DHP) Softmax layer that adds a rapid
learning plastic component (compressed episodic memory) to the fixed (slow
changing) parameters of the softmax output layer; enabling learned
representations to be retained for a longer timescale. We demonstrate the
flexibility of our method by integrating well-known task-specific synaptic
consolidation methods to penalize changes in the slow weights that are
important for each target task. We evaluate our approach on the Permuted MNIST,
Split MNIST and Vision Datasets Mixture benchmarks, and introduce an imbalanced
variant of Permuted MNIST -- a dataset that combines the challenges of class
imbalance and concept drift. Our proposed model requires no additional
hyperparameters and outperforms comparable baselines by reducing forgetting.Comment: Published as a conference paper at IJCNN 202
Time-aware Large Kernel Convolutions
To date, most state-of-the-art sequence modeling architectures use attention
to build generative models for language based tasks. Some of these models use
all the available sequence tokens to generate an attention distribution which
results in time complexity of . Alternatively, they utilize depthwise
convolutions with softmax normalized kernels of size acting as a
limited-window self-attention, resulting in time complexity of .
In this paper, we introduce Time-aware Large Kernel (TaLK) Convolutions, a
novel adaptive convolution operation that learns to predict the size of a
summation kernel instead of using a fixed-sized kernel matrix. This method
yields a time complexity of , effectively making the sequence encoding
process linear to the number of tokens. We evaluate the proposed method on
large-scale standard machine translation, abstractive summarization and
language modeling datasets and show that TaLK Convolutions constitute an
efficient improvement over other attention/convolution based approaches.Comment: Accepted by ICML 202
Relational recurrent neural networks
Memory-based neural networks model temporal data by leveraging an ability to
remember information for long periods. It is unclear, however, whether they
also have an ability to perform complex relational reasoning with the
information they remember. Here, we first confirm our intuitions that standard
memory architectures may struggle at tasks that heavily involve an
understanding of the ways in which entities are connected -- i.e., tasks
involving relational reasoning. We then improve upon these deficits by using a
new memory module -- a \textit{Relational Memory Core} (RMC) -- which employs
multi-head dot product attention to allow memories to interact. Finally, we
test the RMC on a suite of tasks that may profit from more capable relational
reasoning across sequential information, and show large gains in RL domains
(e.g. Mini PacMan), program evaluation, and language modeling, achieving
state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord
datasets
Sparse Meta Networks for Sequential Adaptation and its Application to Adaptive Language Modelling
Training a deep neural network requires a large amount of single-task data
and involves a long time-consuming optimization phase. This is not scalable to
complex, realistic environments with new unexpected changes. Humans can perform
fast incremental learning on the fly and memory systems in the brain play a
critical role. We introduce Sparse Meta Networks -- a meta-learning approach to
learn online sequential adaptation algorithms for deep neural networks, by
using deep neural networks. We augment a deep neural network with a
layer-specific fast-weight memory. The fast-weights are generated sparsely at
each time step and accumulated incrementally through time providing a useful
inductive bias for online continual adaptation. We demonstrate strong
performance on a variety of sequential adaptation scenarios, from a simple
online reinforcement learning to a large scale adaptive language modelling.Comment: 9 pages, 4 figures, 2 table