20 research outputs found
On challenges in training recurrent neural networks
Dans un problème de prédiction à multiples pas discrets, la prédiction à chaque instant peut dépendre de l’entrée à n’importe quel moment dans un passé lointain. Modéliser une telle dépendance à long terme est un des problèmes fondamentaux en apprentissage automatique. En théorie, les Réseaux de Neurones Récurrents (RNN) peuvent modéliser toute dépendance à long terme. En pratique, puisque la magnitude des gradients peut croître ou décroître exponentiellement avec la durée de la séquence, les RNNs ne peuvent modéliser que les dépendances à court terme. Cette thèse explore ce problème dans les réseaux de neurones récurrents et propose de nouvelles solutions pour celui-ci.
Le chapitre 3 explore l’idée d’utiliser une mémoire externe pour stocker les états cachés d’un réseau à Mémoire Long et Court Terme (LSTM). En rendant l’opération d’écriture et de lecture de la mémoire externe discrète, l’architecture proposée réduit le taux de décroissance des gradients dans un LSTM. Ces opérations discrètes permettent également au réseau de créer des connexions dynamiques sur de longs intervalles de temps. Le chapitre 4 tente de caractériser cette décroissance des gradients dans un réseau de neurones récurrent et propose une nouvelle architecture récurrente qui, grâce à sa conception, réduit ce problème. L’Unité Récurrente Non-saturante (NRUs) proposée n’a pas de fonction d’activation saturante et utilise la mise à jour additive de cellules au lieu de la mise à jour multiplicative.
Le chapitre 5 discute des défis de l’utilisation de réseaux de neurones récurrents dans un contexte d’apprentissage continuel, où de nouvelles tâches apparaissent au fur et à mesure. Les dépendances dans l’apprentissage continuel ne sont pas seulement contenues dans une tâche, mais sont aussi présentes entre les tâches. Ce chapitre discute de deux problèmes fondamentaux dans l’apprentissage continuel: (i) l’oubli catastrophique d’anciennes tâches et (ii) la capacité de saturation du réseau. De plus, une solution est proposée pour régler ces deux problèmes lors de l’entraînement d’un réseau de neurones récurrent.In a multi-step prediction problem, the prediction at each time step can depend on the input at any of the previous time steps far in the past. Modelling such long-term dependencies is one of the fundamental problems in machine learning. In theory, Recurrent Neural Networks (RNNs) can model any long-term dependency. In practice, they can only model short-term dependencies due to the problem of vanishing and exploding gradients. This thesis explores the problem of vanishing gradient in recurrent neural networks and proposes novel solutions for the same.
Chapter 3 explores the idea of using external memory to store the hidden states of a Long Short Term Memory (LSTM) network. By making the read and write operations of the external memory discrete, the proposed architecture reduces the rate of gradients vanishing in an LSTM. These discrete operations also enable the network to create dynamic skip connections across time. Chapter 4 attempts to characterize all the sources of vanishing gradients in a recurrent neural network and proposes a new recurrent architecture which has significantly better gradient flow than state-of-the-art recurrent architectures. The proposed Non-saturating Recurrent Units (NRUs) have no saturating activation functions and use additive cell updates instead of multiplicative cell updates.
Chapter 5 discusses the challenges of using recurrent neural networks in the context of lifelong learning. In the lifelong learning setting, the network is expected to learn a series of tasks over its lifetime. The dependencies in lifelong learning are not just within a task, but also across the tasks. This chapter discusses the two fundamental problems in lifelong learning: (i) catastrophic forgetting of old tasks, and (ii) network capacity saturation. Further, it proposes a solution to solve both these problems while training a recurrent neural network
MLMLM: link prediction with mean likelihood masked language model
ABSTRACT: Knowledge Bases (KBs) are easy to query, verifiable, and interpretable. They however scale with man-hours and high-quality data. Masked Language Models (MLMs), such as BERT, scale with computing power as well as unstructured raw text data. The knowledge contained within these models is however not directly interpretable. We propose to perform link prediction with MLMs to address both the KBs scalability issues and the MLMs interpretability issues. By committing the knowledge embedded in MLMs to a KB, it becomes interpretable. To do that we introduce MLMLM, Mean Likelihood Masked Language Model, an approach comparing the mean likelihood of generating the different entities to perform link prediction in a tractable manner. We obtain State of the Art (SotA) results on the WN18RR dataset and SotA results on the Precision@1 metric on the WikidataM5 inductive and transductive setting. We also obtain convincing results on link prediction on previously unseen entities, making MLMLM a suitable approach to introducing new entities to a KB
Local Structure Matters Most: Perturbation Study in NLU
Recent research analyzing the sensitivity of natural language understanding
models to word-order perturbations has shown that neural models are
surprisingly insensitive to the order of words. In this paper, we investigate
this phenomenon by developing order-altering perturbations on the order of
words, subwords, and characters to analyze their effect on neural models'
performance on language understanding tasks. We experiment with measuring the
impact of perturbations to the local neighborhood of characters and global
position of characters in the perturbed texts and observe that perturbation
functions found in prior literature only affect the global ordering while the
local ordering remains relatively unperturbed. We empirically show that neural
models, invariant of their inductive biases, pretraining scheme, or the choice
of tokenization, mostly rely on the local structure of text to build
understanding and make limited use of the global structure.Comment: 11 pages, 13 figure + appendi
A Brief Study on the Effects of Training Generative Dialogue Models with a Semantic loss
Neural models trained for next utterance generation in dialogue task learn to
mimic the n-gram sequences in the training set with training objectives like
negative log-likelihood (NLL) or cross-entropy. Such commonly used training
objectives do not foster generating alternate responses to a context. But, the
effects of minimizing an alternate training objective that fosters a model to
generate alternate response and score it on semantic similarity has not been
well studied. We hypothesize that a language generation model can improve on
its diversity by learning to generate alternate text during training and
minimizing a semantic loss as an auxiliary objective. We explore this idea on
two different sized data sets on the task of next utterance generation in goal
oriented dialogues. We make two observations (1) minimizing a semantic
objective improved diversity in responses in the smaller data set (Frames) but
only as-good-as minimizing the NLL in the larger data set (MultiWoZ) (2) large
language model embeddings can be more useful as a semantic loss objective than
as initialization for token embeddings.Comment: Accepted at SIGDial 202