3,688 research outputs found
Learning to Remember Translation History with a Continuous Cache
Existing neural machine translation (NMT) models generally translate
sentences in isolation, missing the opportunity to take advantage of
document-level information. In this work, we propose to augment NMT models with
a very light-weight cache-like memory network, which stores recent hidden
representations as translation history. The probability distribution over
generated words is updated online depending on the translation history
retrieved from the memory, endowing NMT models with the capability to
dynamically adapt over time. Experiments on multiple domains with different
topics and styles show the effectiveness of the proposed approach with
negligible impact on the computational cost.Comment: Accepted by TACL 201
Improved Language Modeling by Decoding the Past
Highly regularized LSTMs achieve impressive results on several benchmark
datasets in language modeling. We propose a new regularization method based on
decoding the last token in the context using the predicted distribution of the
next token. This biases the model towards retaining more contextual
information, in turn improving its ability to predict the next token. With
negligible overhead in the number of parameters and training time, our Past
Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on
the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax.
We also show gains by using PDR in combination with a mixture-of-softmaxes,
achieving a word level perplexity of 53.8 and 60.5 on these datasets. In
addition, our method achieves 1.169 bits-per-character on the Penn Treebank
Character dataset for character level language modeling. These results
constitute a new state-of-the-art in their respective settings
Unbounded cache model for online language modeling with open vocabulary
Recently, continuous cache models were proposed as extensions to recurrent
neural network language models, to adapt their predictions to local changes in
the data distribution. These models only capture the local context, of up to a
few thousands tokens. In this paper, we propose an extension of continuous
cache models, which can scale to larger contexts. In particular, we use a large
scale non-parametric memory component that stores all the hidden activations
seen in the past. We leverage recent advances in approximate nearest neighbor
search and quantization algorithms to store millions of representations while
searching them efficiently. We conduct extensive experiments showing that our
approach significantly improves the perplexity of pre-trained language models
on new distributions, and can scale efficiently to much larger contexts than
previously proposed local cache models.Comment: Accepted to NIPS 201
Input-to-Output Gate to Improve RNN Language Models
This paper proposes a reinforcing method that refines the output layers of
existing Recurrent Neural Network (RNN) language models. We refer to our
proposed method as Input-to-Output Gate (IOG). IOG has an extremely simple
structure, and thus, can be easily combined with any RNN language models. Our
experiments on the Penn Treebank and WikiText-2 datasets demonstrate that IOG
consistently boosts the performance of several different types of current
topline RNN language models.Comment: Accepted as a conference paper in IJCNLP 201
Modeling Vocabulary for Big Code Machine Learning
When building machine learning models that operate on source code, several
decisions have to be made to model source-code vocabulary. These decisions can
have a large impact: some can lead to not being able to train models at all,
others significantly affect performance, particularly for Neural Language
Models. Yet, these decisions are not often fully described. This paper lists
important modeling choices for source code vocabulary, and explores their
impact on the resulting vocabulary on a large-scale corpus of 14,436 projects.
We show that a subset of decisions have decisive characteristics, allowing to
train accurate Neural Language Models quickly on a large corpus of 10,106
projects.Comment: 12 pages, 1 figur
Deep Residual Output Layers for Neural Language Generation
Many tasks, including language generation, benefit from learning the
structure of the output space, particularly when the space of output labels is
large and the data is sparse. State-of-the-art neural language models
indirectly capture the output space structure in their classifier weights since
they lack parameter sharing across output labels. Learning shared output label
mappings helps, but existing methods have limited expressivity and are prone to
overfitting. In this paper, we investigate the usefulness of more powerful
shared mappings for output labels, and propose a deep residual output mapping
with dropout between layers to better capture the structure of the output space
and avoid overfitting. Evaluations on three language generation tasks show that
our output label mapping can match or improve state-of-the-art recurrent and
self-attention architectures, and suggest that the classifier does not
necessarily need to be high-rank to better model natural language if it is
better at capturing the structure of the output space.Comment: To appear in ICML 201
Maybe Deep Neural Networks are the Best Choice for Modeling Source Code
Statistical language modeling techniques have successfully been applied to
source code, yielding a variety of new software development tools, such as
tools for code suggestion and improving readability. A major issue with these
techniques is that code introduces new vocabulary at a far higher rate than
natural language, as new identifier names proliferate. But traditional language
models limit the vocabulary to a fixed set of common words. For code, this
strong assumption has been shown to have a significant negative effect on
predictive performance. But the open vocabulary version of the neural network
language models for code have not been introduced in the literature. We present
a new open-vocabulary neural language model for code that is not limited to a
fixed vocabulary of identifier names. We employ a segmentation into subword
units, subsequences of tokens chosen based on a compression criterion,
following previous work in machine translation. Our network achieves best in
class performance, outperforming even the state-of-the-art methods of
Hellendoorn and Devanbu that are designed specifically to model code.
Furthermore, we present a simple method for dynamically adapting the model to a
new test project, resulting in increased performance. We showcase our
methodology on code corpora in three different languages of over a billion
tokens each, hundreds of times larger than in previous work. To our knowledge,
this is the largest neural language model for code that has been reported
Modeling Coherence for Neural Machine Translation with Dynamic and Topic Caches
Sentences in a well-formed text are connected to each other via various links
to form the cohesive structure of the text. Current neural machine translation
(NMT) systems translate a text in a conventional sentence-by-sentence fashion,
ignoring such cross-sentence links and dependencies. This may lead to generate
an incoherent target text for a coherent source text. In order to handle this
issue, we propose a cache-based approach to modeling coherence for neural
machine translation by capturing contextual information either from recently
translated sentences or the entire document. Particularly, we explore two types
of caches: a dynamic cache, which stores words from the best translation
hypotheses of preceding sentences, and a topic cache, which maintains a set of
target-side topical words that are semantically related to the document to be
translated. On this basis, we build a new layer to score target words in these
two caches with a cache-based neural model. Here the estimated probabilities
from the cache-based neural model are combined with NMT probabilities into the
final word prediction probabilities via a gating mechanism. Finally, the
proposed cache-based neural model is trained jointly with NMT system in an
end-to-end manner. Experiments and analysis presented in this paper demonstrate
that the proposed cache-based model achieves substantial improvements over
several state-of-the-art SMT and NMT baselines.Comment: Accepted by COLING2018,11 pages, 3 figure
FRAGE: Frequency-Agnostic Word Representation
Continuous word representation (aka word embedding) is a basic building block
in many neural network-based models used in natural language processing tasks.
Although it is widely accepted that words with similar semantics should be
close to each other in the embedding space, we find that word embeddings
learned in several tasks are biased towards word frequency: the embeddings of
high-frequency and low-frequency words lie in different subregions of the
embedding space, and the embedding of a rare word and a popular word can be far
from each other even if they are semantically similar. This makes learned word
embeddings ineffective, especially for rare words, and consequently limits the
performance of these neural network models. In this paper, we develop a neat,
simple yet effective way to learn \emph{FRequency-AGnostic word Embedding}
(FRAGE) using adversarial training. We conducted comprehensive studies on ten
datasets across four natural language processing tasks, including word
similarity, language modeling, machine translation and text classification.
Results show that with FRAGE, we achieve higher performance than the baselines
in all tasks.Comment: To appear in NIPS 201
WeNet: Weighted Networks for Recurrent Network Architecture Search
In recent years, there has been increasing demand for automatic architecture
search in deep learning. Numerous approaches have been proposed and led to
state-of-the-art results in various applications, including image
classification and language modeling. In this paper, we propose a novel way of
architecture search by means of weighted networks (WeNet), which consist of a
number of networks, with each assigned a weight. These weights are updated with
back-propagation to reflect the importance of different networks. Such weighted
networks bear similarity to mixture of experts. We conduct experiments on Penn
Treebank and WikiText-2. We show that the proposed WeNet can find recurrent
architectures which result in state-of-the-art performance
- …