Search CORE

21 research outputs found

Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes

Author: A Dickinson
A Dickinson
A Dickinson
A Mas-Colell
A Rangel
A Shah
A Yuille
AD Redish
AD Redish
AG Barto
Amir Dezfouli
AT Welford
B Balleine
B Shiv
BW Balleine
C Vickrey
CD Adams
D Belin
D Hu
D Joel
DE Broadbent
DM Jackson
E Alluisi
E Alluisi
E Tolman
E Tolman
G Gigerenzer
G Gigerenzer
GD Carr
GH Mowbray
H Simon
H Simon
H Simon
H Tassinari
HH Yin
IM Spigel
JD Salamone
JD Sokolowski
JE Aberman
JI Gold
JL Evenden
JN Tsitsiklis
JR Taylor
JR Taylor
K Muenzinger
M Correa
M Geist
M Haruno
M Jueptner
M Jueptner
M Lyons
M Pessiglione
Mehdi Keramati
MF Brown
ML Evans
ND Daw
ND Daw
NL Munn
Payam Piray
PC Holland
PR Montague
R Dearden
R Howard
R Hyman
RE Suri
RK Mahurin
RL Buckner
RM Colwill
RM Colwill
RS Sutton
S Killcross
S Mingote
S Zilberstein
SA Ellias
SJ Julier
SM McClure
SN Haber
SN Haber
T Ljungberg
Tim Behrens
TW Robbins
W Schultz
WE Hick
Y Kosaki
Y Niv
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Instrumental responses are hypothesized to be of two kinds: habitual and goal-directed, mediated by the sensorimotor and the associative cortico-basal ganglia circuits, respectively. The existence of the two heterogeneous associative learning mechanisms can be hypothesized to arise from the comparative advantages that they have at different stages of learning. In this paper, we assume that the goal-directed system is behaviourally flexible, but slow in choice selection. The habitual system, in contrast, is fast in responding, but inflexible in adapting its behavioural strategy to new conditions. Based on these assumptions and using the computational theory of reinforcement learning, we propose a normative model for arbitration between the two processes that makes an approximately optimal balance between search-time and accuracy in decision making. Behaviourally, the model can explain experimental evidence on behavioural sensitivity to outcome at the early stages of learning, but insensitivity at the later stages. It also explains that when two choices with equal incentive values are available concurrently, the behaviour remains outcome-sensitive, even after extensive training. Moreover, the model can explain choice reaction time variations during the course of learning, as well as the experimental observation that as the number of choices increases, the reaction time also increases. Neurobiologically, by assuming that phasic and tonic activities of midbrain dopamine neurons carry the reward prediction error and the average reward signals used by the model, respectively, the model predicts that whereas phasic dopamine indirectly affects behaviour through reinforcing stimulus-response associations, tonic dopamine can directly affect behaviour through manipulating the competition between the habitual and the goal-directed systems and thus, affect reaction time

CiteSeerX

Public Library of Science (PLOS)

City Research Online

Crossref

Directory of Open Access Journals

PubMed Central

UCL Discovery

Computational processes of simultaneous learning of stochasticity and volatility in humans

Author: Nathaniel Daw
Payam Piray
Publication venue: PsyArXiv
Publication date: 17/03/2024
Field of study

Adapting to uncertain environments is crucial for survival. This study explores computational challenges in distinguishing two types of noise: volatility and stochasticity. Volatility refers to diffusion noise in latent causes, requiring a higher learning rate, while stochasticity introduces moment-to-moment observation noise and reduces learning rate. For the learner, dissociating their effects on one’s observations is challenging because they both increase the variance of observations. Previous research examined these factors separately, but it remains unclear whether and how humans dissociate them. In two large-scale experiments, through novel behavioral tasks and computational modeling, we report compelling evidence of humans dissociating volatility and stochasticity solely based on their observations. We observed contrasting effects of volatility and stochasticity on learning rates, consistent with statistical principles. These results are consistent with a computational model that estimates volatility and stochasticity by balancing their dueling effects, but not with a number of other models that fail to make this distinction. This research elucidates computational processes behind adaptive learning in uncertain environments

PsyArxiv

A simple model for learning in volatile environments.

Author: Nathaniel D Daw
Payam Piray
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/07/2020
Field of study

Sound principles of statistical inference dictate that uncertainty shapes learning. In this work, we revisit the question of learning in volatile environments, in which both the first and second-order statistics of observations dynamically evolve over time. We propose a new model, the volatile Kalman filter (VKF), which is based on a tractable state-space model of uncertainty and extends the Kalman filter algorithm to volatile environments. The proposed model is algorithmically simple and encompasses the Kalman filter as a special case. Specifically, in addition to the error-correcting rule of Kalman filter for learning observations, the VKF learns volatility according to a second error-correcting rule. These dual updates echo and contextualize classical psychological models of learning, in particular hybrid accounts of Pearce-Hall and Rescorla-Wagner. At the computational level, compared with existing models, the VKF gives up some flexibility in the generative model to enable a more faithful approximation to exact inference. When fit to empirical data, the VKF is better behaved than alternatives and better captures human choice data in two independent datasets of probabilistic learning tasks. The proposed model provides a coherent account of learning in stable or volatile environments and has implications for decision neuroscience research

Directory of Open Access Journals

Prediction of different dual-process accounts about the dominant process after extensive training.

Author: Amir Dezfouli (350781)
Mehdi Keramati (350780)
Payam Piray (350782)
Publication venue
Publication date
Field of study

Prediction of different dual-process accounts about the dominant process after extensive training.</p

FigShare

Tree representation of the reversal learning task, used in [27], and the behavioural results.

Author: Amir Dezfouli (350781)
Mehdi Keramati (350780)
Payam Piray (350782)
Publication venue
Publication date
Field of study

(A) When each trial begins, one of the two stimuli, or , is presented in random on a screen. The subject can then choose whether to touch the screen ( action) or not ( action). The task is performed in three phases: training, reversal, and extinction. During the training phase, the subject will receive a reward if the stimulus is presented and the action is performed by the subject, or if the stimulus is presented and the action is selected (). During the reversal phase, the reward function is reversed, meaning that the action must be chosen when the stimulus is presented, and vice versa (). Finally, during the extinction phase, regardless of the presented stimulus, only the action leads to a reward (). (B) During both the training and reversal phases, subjects' reaction time is high at the early stages when they don't have enough experience with the new conditions yet. However, after some trials, the reaction time declines significantly. Error bars represent .</p

FigShare

Simulation results for the task of Figure 4.

Author: Amir Dezfouli (350781)
Mehdi Keramati (350780)
Payam Piray (350782)
Publication venue
Publication date
Field of study

The results show that since the reinforcing value of the two outcomes is equal, there is a huge overlap between the distribution functions over the -values of actions and , at state , even after extensive training (240 trials) (Plots and ). Accordingly, the signals (benefit of goal-directed deliberation) for these two actions remain higher than the signal (cost of deliberation) (Plot ) and thus, the goal-directed system is always engaged in value-estimation for these two choices. The behaviourally observable result is that responding remains sensitive to revaluation of outcomes, even though devaluation has happened after a prolonged training period (Plots and ).</p

FigShare

Simulation results of the model in the reversal learning task depicted in Figure 6.

Author: Amir Dezfouli (350781)
Mehdi Keramati (350780)
Payam Piray (350782)
Publication venue
Publication date
Field of study

Since the signals have high values at the early stages of learning (plot ), the goal-directed system is active and thus, the deliberation time is relatively high (plot ). After further training, the habitual system takes control over behaviour (plot ) and as a result, the model's reaction time decreases (plot ). After reversal, it takes some trials for the habitual system to realize that the cached -values are not precise anymore (equivalent to an increase in the variance of ). Thus, after some trials after reversal, the signal increases again (plot ), which results in re-activation of the goal-directed system. As a result, the model's reaction time increases again (plot ). A similar explanation holds for the rest of the trials. In sum, consistent with the experimental data, the reaction time is higher during the searching period, than the applying period.</p

FigShare

Tree representation of the devaluation experiment with two levers available concurrently.

Author: Amir Dezfouli (350781)
Mehdi Keramati (350780)
Payam Piray (350782)
Publication venue
Publication date
Field of study

(A) In the training phase, either pressing lever one or pressing lever two , if followed by entering the magazine , results in acquiring one unit of either of the two rewards, or , respectively. The reinforcing value of the two rewards is equal to one. Other action sequences lead to no reward. As in the task of <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002055#pcbi-1002055-g002" target="_blank">Figure 2</a> , this task is also assumed to be cyclic. (B) In the devaluation phase, the outcome of one of the responses () is devalued (), whereas the rewarding value of the outcome of the other response () has remained unchanged. After the devaluation phase, the animal's behaviour is tested in extinction (for space consideration, this phase is not illustrated). Similar to the task of <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002055#pcbi-1002055-g002" target="_blank">Figure 2</a> , neither nor is delivered to the animal in the test phase.</p

FigShare

Goal-directed and habitual decision making under stress in Gambling Disorder: an fMRI study

Author: Anneke Goudriaan
Payam Piray
Ruth van Holst
Tim van Timmeren
Publication venue: PsyArXiv
Publication date: 14/09/2023
Field of study

The development of addictive behaviors has been suggested to be related to a transition from goal-directed to habitual decision making. Stress is a factor known to prompt habitual behavior and to increase the risk for addiction and relapse. In the current study, we therefore used functional MRI to investigate the balance between goal-directed ‘model-based’ and habitual ‘model-free’ control systems and whether acute stress would differentially shift this balance in gambling disorder (GD) patients compared to healthy controls (HCs). Using a within-subject design, 22 patients with GD and 20 HCs underwent stress induction or a control condition before performing a multistep decision-making task during fMRI. Salivary cortisol levels showed that the stress induction was successful. Contrary to our hypothesis, GD patients showed intact goal-directed decision making, which remained similar to HCs after stress induction. Bayes factors provided substantial evidence against a difference between the groups or a group-by-stress interaction on the balance between model-based and model-free decision making. Similarly, neural estimates did not differ between groups and conditions. These results challenge the notion that GD is related to an increased reliance on habitual (or decreased goal-directed) control, even during stress

PsyArxiv

Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes

Computational processes of simultaneous learning of stochasticity and volatility in humans

A simple model for learning in volatile environments.

Prediction of different dual-process accounts about the dominant process after extensive training.

Tree representation of the reversal learning task, used in [<b>27</b>], and the behavioural results.

Simulation results for the task of <b>Figure 4</b>.

Simulation results of the model in the reversal learning task depicted in <b>Figure 6</b>.

Tree representation of the devaluation experiment with two levers available concurrently.

Goal-directed and habitual decision making under stress in Gambling Disorder: an fMRI study