Search CORE

42 research outputs found

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Author: Mao Huanru Henry
Publication venue
Publication date: 09/10/2022
Field of study

Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes

arXiv.org e-Print Archive

Inductive biases for efficient information transfer in artificial networks

Author: Kerg Giancarlo
Publication venue
Publication date: 01/09/2022
Field of study

Malgré des progrès remarquables dans une grande variété de sujets, les réseaux de neurones éprouvent toujours des difficultés à exécuter certaines tâches simples pour lesquelles les humains excellent. Comme indiqué dans des travaux récents, nous émettons l'hypothèse que l'écart qualitatif entre l'apprentissage en profondeur actuel et l'intelligence humaine est le résultat de biais inductifs essentiels manquants. En d'autres termes, en identifiant certains de ces biais inductifs essentiels, nous améliorerons le transfert d'informations dans les réseaux artificiels, ainsi que certaines de leurs limitations actuelles les plus importantes sur un grand ensemble de tâches. Les limites sur lesquelles nous nous concentrerons dans cette thèse sont la généralisation systématique hors distribution et la capacité d'apprendre sur des échelles de temps extrêmement longues. Dans le premier article, nous nous concentrerons sur l'extension des réseaux de neurones récurrents (RNN) à contraintes spectrales et proposerons une nouvelle structure de connectivité basée sur la décomposition de Schur, en conservant les avantages de stabilité et la vitesse d'entraînement des RNN orthogonaux tout en améliorant l'expressivité pour les calculs complexes à court terme par des dynamiques transientes. Cela sert de première étape pour atténuer le problème du "exploding vanishing gradient" (EVGP). Dans le deuxième article, nous nous concentrerons sur les RNN avec une mémoire externe et un mécanisme d'auto-attention comme un moyen alternatif de résoudre le problème du EVGP. Ici, la contribution principale sera une analyse formelle sur la stabilité asymptotique du gradient, et nous identifierons la pertinence d'événements comme un ingrédient clé pour mettre à l'échelle les systèmes d'attention. Nous exploitons ensuite ces résultats théoriques pour fournir un nouveau mécanisme de dépistage de la pertinence, qui permet de concentrer l'auto-attention ainsi que de la mettre à l'échelle, tout en maintenant une bonne propagation du gradient sur de longues séquences. Enfin, dans le troisième article, nous distillons un ensemble minimal de biais inductifs pour les tâches cognitives purement relationnelles et identifions que la séparation des informations relationnelles des entrées sensorielles est un ingrédient inductif clé pour la généralisation OoD sur des entrées invisibles. Nous discutons en outre des extensions aux relations non-vues ainsi que des entrées avec des signaux parasites.Despite remarkable advances in a wide variety of subjects, neural networks are still struggling on simple tasks humans excel at. As outlined in recent work, we hypothesize that the qualitative gap between current deep learning and human-level artificial intelligence is the result of missing essential inductive biases. In other words, by identifying some of these key inductive biases, we will improve information transfer in artificial networks, as well as improve on some of their current most important limitations on a wide range of tasks. The limitations we will focus on in this thesis are out-of-distribution systematic generalization and the ability to learn over extremely long-time scales. In the First Article, we will focus on extending spectrally constrained Recurrent Neural Networks (RNNs), and propose a novel connectivity structure based on the Schur decomposition, retaining the stability advantages and training speed of orthogonal RNNs while enhancing expressivity for short-term complex computations via transient dynamics. This serves as a first step in mitigating the Exploding Vanishing Gradient Problem (EVGP). In the Second Article, we will focus on memory augmented self-attention RNNs as an alternative way to tackling the Exploding Vanishing Gradient Problem (EVGP). Here the main contribution will be a formal analysis on asymptotic gradient stability, and we will identify event relevancy as a key ingredient to scale attention systems. We then leverage these theoretical results to provide a novel relevancy screening mechanism, which makes self-attention sparse and scalable, while maintaining good gradient propagation over long sequences. Finally, in the Third Article, we distill a minimal set of inductive biases for purely relational cognitive tasks, and identify that separating relational information from sensory input is a key inductive ingredient for OoD generalization on unseen inputs. We further discuss extensions to unseen relations as well as settings with spurious features

Dépôt Institutionnel Numérique

BolT: Fused Window Transformers for fMRI Time Series Analysis

Author: Bedel Hasan Atakan
Dalmaz Onat
Dar Salman Ul Hassan
Çukur Tolga
Şıvgın Irmak
Publication venue
Publication date: 19/09/2022
Field of study

Deep-learning models have enabled performance leaps in analysis of high-dimensional functional MRI (fMRI) data. Yet, many previous methods are suboptimally sensitive for contextual representations across diverse time scales. Here, we present BolT, a blood-oxygen-level-dependent transformer model, for analyzing multi-variate fMRI time series. BolT leverages a cascade of transformer encoders equipped with a novel fused window attention mechanism. Encoding is performed on temporally-overlapped windows within the time series to capture local representations. To integrate information temporally, cross-window attention is computed between base tokens in each window and fringe tokens from neighboring windows. To gradually transition from local to global representations, the extent of window overlap and thereby number of fringe tokens are progressively increased across the cascade. Finally, a novel cross-window regularization is employed to align high-level classification features across the time series. Comprehensive experiments on large-scale public datasets demonstrate the superior performance of BolT against state-of-the-art methods. Furthermore, explanatory analyses to identify landmark time points and regions that contribute most significantly to model decisions corroborate prominent neuroscientific findings in the literature

arXiv.org e-Print Archive

Towards better understanding and improving optimization in recurrent neural networks

Author: Kanuparthi Bhargav
Publication venue
Publication date: 01/07/2020
Field of study

Recurrent neural networks (RNN) are known for their notorious exploding and vanishing gradient problem (EVGP). This problem becomes more evident in tasks where the information needed to correctly solve them exist over long time scales, because it prevents important gradient components from being back-propagated adequately over a large number of steps. The papers written in this work formalizes gradient propagation in parametric and semi-parametric RNNs to gain a better understanding towards the source of this problem. The first paper introduces a simple stochastic algorithm (h-detach) that is specific to LSTM optimization and targeted towards addressing the EVGP problem. Using this we show significant improvements over vanilla LSTM in terms of convergence speed, robustness to seed and learning rate, and generalization on various benchmark datasets. The next paper focuses on semi-parametric RNNs and self-attentive networks. Self-attention provides a way by which a system can dynamically access past states (stored in memory) which helps in mitigating vanishing of gradients. Although useful, it is difficult to scale as the size of the computational graph grows quadratically with the number of time steps involved. In the paper we describe a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence while ensuring good gradient propagation.Les réseaux de neurones récurrents (RNN) sont connus pour leur problème de gradient d'explosion et de disparition notoire (EVGP). Ce problème devient plus évident dans les tâches où les informations nécessaires pour les résoudre correctement existent sur de longues échelles de temps, car il empêche les composants de gradient importants de se propager correctement sur un grand nombre d'étapes. Les articles écrits dans ce travail formalise la propagation du gradient dans les RNN paramétriques et semi-paramétriques pour mieux comprendre la source de ce problème. Le premier article présente un algorithme stochastique simple (h-detach) spécifique à l'optimisation LSTM et visant à résoudre le problème EVGP. En utilisant cela, nous montrons des améliorations significatives par rapport au LSTM vanille en termes de vitesse de convergence, de robustesse au taux d'amorçage et d'apprentissage, et de généralisation sur divers ensembles de données de référence. Le prochain article se concentre sur les RNN semi-paramétriques et les réseaux auto-attentifs. L'auto-attention fournit un moyen par lequel un système peut accéder dynamiquement aux états passés (stockés en mémoire), ce qui aide à atténuer la disparition des gradients. Bien qu'utile, il est difficile à mettre à l'échelle car la taille du graphe de calcul augmente de manière quadratique avec le nombre de pas de temps impliqués. Dans l'article, nous décrivons un mécanisme de criblage de pertinence, inspiré par le processus cognitif de consolidation de la mémoire, qui permet une utilisation évolutive de l'auto-attention clairsemée avec récurrence tout en assurant une bonne propagation du gradient

Dépôt Institutionnel Numérique

The Brain's Router: A Cortical Network Model of Serial Processing in the Primate Brain

Author: A Allport
A Allport
A Bollimunta
A Burkhalter
A Del Cul
A Newell
A Osman
A Pouget
A Stemme
A Treisman
A Treisman
A Zylberberg
AD Baddeley
Ariel Zylberberg
B Brisson
B Wyble
B Wyble
BA Olshausen
BA Olshausen
BJ Baars
C Kranczioch
C Kranczioch
C Sergent
CC Lo
CC Lo
CE Jahr
CE Schroeder
CE Schroeder
CJ Bruce
CN Olivers
CN Olivers
CW Telford
D Heinke
D Heinke
D Heinke
D Jilk
D Zipser
DD Stettler
DE Meyer
Diego Fernández Slezak
DJ Amit
DJ Felleman
DJ Freedman
DJ Freedman
DS Touretzky
E Formisano
E Koechlin
E Koechlin
E Salinas
E Salinas
EK Miller
EK Vogel
EK Vogel
EL Post
ET Rolls
EW Large
F Kouneiher
F Vinckier
G Deco
G Deco
G Deco
G Deco
G Rainer
G Wylie
GE Alexander
H Bowman
H Bowman
H Pashler
H Pashler
H Pashler
H Pashler
H Pashler
H Pashler
IR Olson
J Bullier
J Duncan
J Kawahara
J Oristaglio
J Tanji
JD Cohen
JD Roitman
JD Schall
JD Wallis
JE Kamienkowski
JE Laird
JE Raymond
JI Gold
JM Fuster
JM Hupe
JR Anderson
JR Anderson
JR Stroop
K Shapiro
K Whittingstall
Karl J. Friston
KF Wong
KFE Wong
KM Arnell
KM Arnell
KM Arnell
KS Rockland
M Graziano
M Graziano
M Riesenhuber
M Sigman
M Sigman
M Sigman
M Watanabe
M Watanabe
Mariano Sigman
MC Smith
MD Byrne
ME Mazurek
MI Posner
MI Posner
MM Chun
MN Shadlen
MN Shadlen
MR Nieuwenstein
MR Nieuwenstein
N Brunel
N Fragopanagos
N Fujii
N Sigala
NA Busch
NK Logothetis
NP Rougier
P Jolicoeur
P Jolicoeur
P Lakatos
P Lakatos
P Sessa
PE Dux
PE Dux
PE Dux
Pieter R. Roelfsema
PL Smith
PR Roelfsema
PR Roelfsema
PR Roelfsema
PR Roelfsema
PR Roelfsema
R Dell'acqua
R Dell'acqua
R Marois
R Marois
R Ratcliff
R Romo
R VanRullen
RC O'Reilly
RD Luce
RS Menon
S Dehaene
S Dehaene
S Dehaene
S Dehaene
S Dehaene
S Funahashi
S Fusi
S Nieuwenhuis
S Ullman
SA Bunge
SA Hillyard
SG Kim
SI Shih
SJ Luck
SJ Luck
SP Wise
Stanislas Dehaene
T Jubault
T Shallice
T Shallice
TAW Visser
V Di Lollo
VA Lamme
W Li
WF Asaad
XJ Wang
Y Jiang
Y Jiang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

The human brain efficiently solves certain operations such as object recognition and categorization through a massively parallel network of dedicated processors. However, human cognition also relies on the ability to perform an arbitrarily large set of tasks by flexibly recombining different processors into a novel chain. This flexibility comes at the cost of a severe slowing down and a seriality of operations (100–500 ms per step). A limit on parallel processing is demonstrated in experimental setups such as the psychological refractory period (PRP) and the attentional blink (AB) in which the processing of an element either significantly delays (PRP) or impedes conscious access (AB) of a second, rapidly presented element. Here we present a spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior. The precise mapping of incoming sensory stimuli onto motor representations relies on a “router” network capable of flexibly interconnecting processors and rapidly changing its configuration from one task to another. Simulations show that, when presented with dual-task stimuli, the network exhibits parallel processing at peripheral sensory levels, a memory buffer capable of keeping the result of sensory processing on hold, and a slow serial performance at the router stage, resulting in a performance bottleneck. The network captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions, and establishes concrete predictions on neuronal dynamics during dual-task experiments in humans and non-human primates

Directory of Open Access Journals

PubMed Central

HAL-CEA

Inductive Biases for Deep Learning of Higher-Level Cognition

Author: Bengio Yoshua
Goyal Anirudh
Publication venue
Publication date: 17/02/2021
Field of study

A fascinating hypothesis is that human and animal intelligence could be explained by a few principles (rather than an encyclopedic list of heuristics). If that hypothesis was correct, we could more easily both understand our own intelligence and build intelligent machines. Just like in physics, the principles themselves would not be sufficient to predict the behavior of complex systems like brains, and substantial computation might be needed to simulate human-like intelligence. This hypothesis would suggest that studying the kind of inductive biases that humans and animals exploit could help both clarify these principles and provide inspiration for AI research and neuroscience theories. Deep learning already exploits several key inductive biases, and this work considers a larger list, focusing on those which concern mostly higher-level and sequential conscious processing. The objective of clarifying these particular principles is that they could potentially help us build AI systems benefiting from humans' abilities in terms of flexible out-of-distribution and systematic generalization, which is currently an area where a large gap exists between state-of-the-art machine learning and human intelligence.Comment: This document contains a review of authors research as part of the requirement of AG's predoctoral exam, an overview of the main contributions of the authors few recent papers (co-authored with several other co-authors) as well as a vision of proposed future researc

arXiv.org e-Print Archive

Temporal Networks

Author: Alon
Anderson
Bajardi
Bajardi
Bansal
Barabási
Barrat
Barrat
Barthélemy
Barthélemy
Bassett
Bearman
Berman
Blonder
Boguñá
Braha
Bui Xuan
Bullmore
Candia
Carley
Cattuto
Chechik
Cheng
Cohen
Cooke
Croft
da Fontoura Costa
de Vico Fallani
Dimitriadis
Eagle
Easley
Eckmann
Farrel
Ferreira
Fortunato
Gautreau
Ghosh
Goh
Gracia
Grindrod
Gross
Gunturi
Hachul
Han
Harary
Harris
Hethcote
Hill
Holme
Holme
Holme
Iribarren
Iribarren
Isella
Isella
Jackson
Jari Saramäki
Jo
Jo
Johansen
Kamp
Karsai
Kauppi
Kempe
Kenah
Kimmel
Kleinberg
Kolar
Komurov
Kostakos
Kovanen
Kretzschmar
Kuhn
Kumpula
Lahiri
Lahiri
Lamport
Liben-Nowell
Liljeros
Liljeros
Liljeros
Lusseau
Lèbre
Lèbre
Malmgren
Malmgren
Medo
Min
Miritello
Moody
Morris
Mucha
Newman
Newman
Nordvik
Oliveira
Onnela
Pahl-Wostl
Palla
Palsson
Pan
Panisson
Park
Pascual
Pastor-Satorras
Pastor-Satorras
Petter Holme
Przytycka
Rao
Riolo
Robins
Rocha
Rocha
Ronhovde
Rosvall
Snijders
Snijders
Sporns
Stehlé
Stehlé
Stehlé
Sundaresan
Szendroi
Takaguchi
Tang
Taylor
Turova
Ueno
Ulanowicz
V Solé
Valencia
Vazquez
Vernon
Volz
Wasserman
Watts
Wu
Yang
Yasseri
Yoshida
Yoshida
Zhao
Zhao
Zhou
Publication venue: 'Elsevier BV'
Publication date: 15/12/2011
Field of study

A great variety of systems in nature, society and technology -- from the web of sexual contacts to the Internet, from the nervous system to power grids -- can be modeled as graphs of vertices coupled by edges. The network structure, describing how the graph is wired, helps us understand, predict and optimize the behavior of dynamical systems. In many cases, however, the edges are not continuously active. As an example, in networks of communication via email, text messages, or phone calls, edges represent sequences of instantaneous or practically instantaneous contacts. In some cases, edges are active for non-negligible periods of time: e.g., the proximity patterns of inpatients at hospitals can be represented by a graph where an edge between two individuals is on throughout the time they are at the same ward. Like network topology, the temporal structure of edge activations can affect dynamics of systems interacting through the network, from disease contagion on the network of patients to information diffusion over an e-mail network. In this review, we present the emergent field of temporal networks, and discuss methods for analyzing topological and temporal structure and models for elucidating their relation to the behavior of dynamical systems. In the light of traditional network theory, one can see this framework as moving the information of when things happen from the dynamical system on the network, to the network itself. Since fundamental properties, such as the transitivity of edges, do not necessarily hold in temporal networks, many of these methods need to be quite different from those for static networks

arXiv.org e-Print Archive

Crossref

Publikationer från Umeå universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

CERN Document Server