42 research outputs found
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
Autoregressive Transformers are strong language models but incur O(T)
complexity during per-token generation due to the self-attention mechanism.
Recent work proposes kernel-based methods to approximate causal self-attention
by replacing it with recurrent formulations with various update rules and
feature maps to achieve O(1) time and memory complexity. We explore these
approaches and find that they are unnecessarily complex, and propose a simple
alternative - decaying fast weights - that runs fast on GPU, outperforms prior
methods, and retains 99% of attention's performance for GPT-2. We also show
competitive performance on WikiText-103 against more complex attention
substitutes
Inductive biases for efficient information transfer in artificial networks
Malgré des progrès remarquables dans une grande variété de sujets, les réseaux de neurones éprouvent toujours des difficultés à exécuter certaines tâches simples pour lesquelles les humains excellent. Comme indiqué dans des travaux récents, nous émettons l'hypothèse que l'écart qualitatif entre l'apprentissage en profondeur actuel et l'intelligence humaine est le résultat de biais inductifs essentiels manquants. En d'autres termes, en identifiant certains de ces biais inductifs essentiels, nous améliorerons le transfert d'informations dans les réseaux artificiels, ainsi que certaines de leurs limitations actuelles les plus importantes sur un grand ensemble de tâches. Les limites sur lesquelles nous nous concentrerons dans cette thèse sont la généralisation systématique hors distribution et la capacité d'apprendre sur des échelles de temps extrêmement longues. Dans le premier article, nous nous concentrerons sur l'extension des réseaux de neurones récurrents (RNN) à contraintes spectrales et proposerons une nouvelle structure de connectivité basée sur la décomposition de Schur, en conservant les avantages de stabilité et la vitesse d'entraînement des RNN orthogonaux tout en améliorant l'expressivité pour les calculs complexes à court terme par des dynamiques transientes. Cela sert de première étape pour atténuer le problème du "exploding vanishing gradient" (EVGP). Dans le deuxième article, nous nous concentrerons sur les RNN avec une mémoire externe et un mécanisme d'auto-attention comme un moyen alternatif de résoudre le problème du EVGP. Ici, la contribution principale sera une analyse formelle sur la stabilité asymptotique du gradient, et nous identifierons la pertinence d'événements comme un ingrédient clé pour mettre à l'échelle les systèmes d'attention. Nous exploitons ensuite ces résultats théoriques pour fournir un nouveau mécanisme de dépistage de la pertinence, qui permet de concentrer l'auto-attention ainsi que de la mettre à l'échelle, tout en maintenant une bonne propagation du gradient sur de longues séquences. Enfin, dans le troisième article, nous distillons un ensemble minimal de biais inductifs pour les tâches cognitives purement relationnelles et identifions que la séparation des informations relationnelles des entrées sensorielles est un ingrédient inductif clé pour la généralisation OoD sur des entrées invisibles. Nous discutons en outre des extensions aux relations non-vues ainsi que des entrées avec des signaux parasites.Despite remarkable advances in a wide variety of subjects, neural networks are still struggling on simple tasks humans excel at. As outlined in recent work, we hypothesize that the qualitative gap between current deep learning and human-level artificial intelligence is the result of missing essential inductive biases. In other words, by identifying some of these key inductive biases, we will improve information transfer in artificial networks, as well as improve on some of their current most important limitations on a wide range of tasks. The limitations we will focus on in this thesis are out-of-distribution systematic generalization and the ability to learn over extremely long-time scales. In the First Article, we will focus on extending spectrally constrained Recurrent Neural Networks (RNNs), and propose a novel connectivity structure based on the Schur decomposition, retaining the stability advantages and training speed of orthogonal RNNs while enhancing expressivity for short-term complex computations via transient dynamics. This serves as a first step in mitigating the Exploding Vanishing Gradient Problem (EVGP). In the Second Article, we will focus on memory augmented self-attention RNNs as an alternative way to tackling the Exploding Vanishing Gradient Problem (EVGP). Here the main contribution will be a formal analysis on asymptotic gradient stability, and we will identify event relevancy as a key ingredient to scale attention systems. We then leverage these theoretical results to provide a novel relevancy screening mechanism, which makes self-attention sparse and scalable, while maintaining good gradient propagation over long sequences. Finally, in the Third Article, we distill a minimal set of inductive biases for purely relational cognitive tasks, and identify that separating relational information from sensory input is a key inductive ingredient for OoD generalization on unseen inputs. We further discuss extensions to unseen relations as well as settings with spurious features
BolT: Fused Window Transformers for fMRI Time Series Analysis
Deep-learning models have enabled performance leaps in analysis of
high-dimensional functional MRI (fMRI) data. Yet, many previous methods are
suboptimally sensitive for contextual representations across diverse time
scales. Here, we present BolT, a blood-oxygen-level-dependent transformer
model, for analyzing multi-variate fMRI time series. BolT leverages a cascade
of transformer encoders equipped with a novel fused window attention mechanism.
Encoding is performed on temporally-overlapped windows within the time series
to capture local representations. To integrate information temporally,
cross-window attention is computed between base tokens in each window and
fringe tokens from neighboring windows. To gradually transition from local to
global representations, the extent of window overlap and thereby number of
fringe tokens are progressively increased across the cascade. Finally, a novel
cross-window regularization is employed to align high-level classification
features across the time series. Comprehensive experiments on large-scale
public datasets demonstrate the superior performance of BolT against
state-of-the-art methods. Furthermore, explanatory analyses to identify
landmark time points and regions that contribute most significantly to model
decisions corroborate prominent neuroscientific findings in the literature
Towards better understanding and improving optimization in recurrent neural networks
Recurrent neural networks (RNN) are known for their notorious exploding and vanishing gradient problem (EVGP). This problem becomes more evident in tasks where the information needed to correctly solve them exist over long time scales, because it prevents important gradient components from being back-propagated adequately over a large number of steps. The papers written in this work formalizes gradient propagation in parametric and semi-parametric RNNs to gain a better understanding towards the source of this problem. The first paper introduces a simple stochastic algorithm (h-detach) that is specific to LSTM optimization and targeted towards addressing the EVGP problem. Using this we show significant improvements over vanilla LSTM in terms of convergence speed, robustness to seed and learning rate, and generalization on various benchmark datasets. The next paper focuses on semi-parametric RNNs and self-attentive networks. Self-attention provides a way by which a system can dynamically access past states (stored in memory) which helps in mitigating vanishing of gradients. Although useful, it is difficult to scale as the size of the computational graph grows quadratically with the number of time steps involved. In the paper we describe a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence while ensuring good gradient propagation.Les réseaux de neurones récurrents (RNN) sont connus pour leur problème de gradient d'explosion et de disparition notoire (EVGP). Ce problème devient plus évident dans les tâches où les informations nécessaires pour les résoudre correctement existent sur de longues échelles de temps, car il empêche les composants de gradient importants de se propager correctement sur un grand nombre d'étapes. Les articles écrits dans ce travail formalise la propagation du gradient dans les RNN paramétriques et semi-paramétriques pour mieux comprendre la source de ce problème. Le premier article présente un algorithme stochastique simple (h-detach) spécifique à l'optimisation LSTM et visant à résoudre le problème EVGP. En utilisant cela, nous montrons des améliorations significatives par rapport au LSTM vanille en termes de vitesse de convergence, de robustesse au taux d'amorçage et d'apprentissage, et de généralisation sur divers ensembles de données de référence. Le prochain article se concentre sur les RNN semi-paramétriques et les réseaux auto-attentifs. L'auto-attention fournit un moyen par lequel un système peut accéder dynamiquement aux états passés (stockés en mémoire), ce qui aide à atténuer la disparition des gradients. Bien qu'utile, il est difficile à mettre à l'échelle car la taille du graphe de calcul augmente de manière quadratique avec le nombre de pas de temps impliqués. Dans l'article, nous décrivons un mécanisme de criblage de pertinence, inspiré par le processus cognitif de consolidation de la mémoire, qui permet une utilisation évolutive de l'auto-attention clairsemée avec récurrence tout en assurant une bonne propagation du gradient
The Brain's Router: A Cortical Network Model of Serial Processing in the Primate Brain
The human brain efficiently solves certain operations such as object recognition and categorization through a massively parallel network of dedicated processors. However, human cognition also relies on the ability to perform an arbitrarily large set of tasks by flexibly recombining different processors into a novel chain. This flexibility comes at the cost of a severe slowing down and a seriality of operations (100–500 ms per step). A limit on parallel processing is demonstrated in experimental setups such as the psychological refractory period (PRP) and the attentional blink (AB) in which the processing of an element either significantly delays (PRP) or impedes conscious access (AB) of a second, rapidly presented element. Here we present a spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior. The precise mapping of incoming sensory stimuli onto motor representations relies on a “router” network capable of flexibly interconnecting processors and rapidly changing its configuration from one task to another. Simulations show that, when presented with dual-task stimuli, the network exhibits parallel processing at peripheral sensory levels, a memory buffer capable of keeping the result of sensory processing on hold, and a slow serial performance at the router stage, resulting in a performance bottleneck. The network captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions, and establishes concrete predictions on neuronal dynamics during dual-task experiments in humans and non-human primates
Inductive Biases for Deep Learning of Higher-Level Cognition
A fascinating hypothesis is that human and animal intelligence could be
explained by a few principles (rather than an encyclopedic list of heuristics).
If that hypothesis was correct, we could more easily both understand our own
intelligence and build intelligent machines. Just like in physics, the
principles themselves would not be sufficient to predict the behavior of
complex systems like brains, and substantial computation might be needed to
simulate human-like intelligence. This hypothesis would suggest that studying
the kind of inductive biases that humans and animals exploit could help both
clarify these principles and provide inspiration for AI research and
neuroscience theories. Deep learning already exploits several key inductive
biases, and this work considers a larger list, focusing on those which concern
mostly higher-level and sequential conscious processing. The objective of
clarifying these particular principles is that they could potentially help us
build AI systems benefiting from humans' abilities in terms of flexible
out-of-distribution and systematic generalization, which is currently an area
where a large gap exists between state-of-the-art machine learning and human
intelligence.Comment: This document contains a review of authors research as part of the
requirement of AG's predoctoral exam, an overview of the main contributions
of the authors few recent papers (co-authored with several other co-authors)
as well as a vision of proposed future researc
Temporal Networks
A great variety of systems in nature, society and technology -- from the web
of sexual contacts to the Internet, from the nervous system to power grids --
can be modeled as graphs of vertices coupled by edges. The network structure,
describing how the graph is wired, helps us understand, predict and optimize
the behavior of dynamical systems. In many cases, however, the edges are not
continuously active. As an example, in networks of communication via email,
text messages, or phone calls, edges represent sequences of instantaneous or
practically instantaneous contacts. In some cases, edges are active for
non-negligible periods of time: e.g., the proximity patterns of inpatients at
hospitals can be represented by a graph where an edge between two individuals
is on throughout the time they are at the same ward. Like network topology, the
temporal structure of edge activations can affect dynamics of systems
interacting through the network, from disease contagion on the network of
patients to information diffusion over an e-mail network. In this review, we
present the emergent field of temporal networks, and discuss methods for
analyzing topological and temporal structure and models for elucidating their
relation to the behavior of dynamical systems. In the light of traditional
network theory, one can see this framework as moving the information of when
things happen from the dynamical system on the network, to the network itself.
Since fundamental properties, such as the transitivity of edges, do not
necessarily hold in temporal networks, many of these methods need to be quite
different from those for static networks