1,670 research outputs found
Reinforcement learning in intelligent control : a biologically-inspired approach to the relearning problem
Merged with duplicate record 10026.1/2240 on 08.20.2017 by CS (TIS)The increasingly complex demands placed on control systems have resulted in a
need for intelligent control, an approach that attempts to meet these demands by emulating
the capabilities found in biological systems. The need to exploit existing knowledge is a
desirable feature of any intelligent control system, and this leads to the relearning problem.
The problem arises when a control system is required to effectively learn new knowledge
whilst exploiting still useful knowledge from past experiences. This thesis describes the
adaptive critic system using reinforcement learning, a computational framework that can
effectively address many of the demands in intelligent control, but is less effective when it
comes to addressing the relearning problem. The thesis argues that biological mechanisms
of reinforcement learning (and relearning) may provide inspiration for developing artificial
intelligent control mechanisms that can better address the relearning problem. A conceptual
model of biological reinforcement learning and relearning is presented, and the thesis
shows how inspiration derived from this model can be used to modify the adaptive critic.
The performance of the modified adaptive critic system on the relearning problem is
investigated based on simulations of the pole balancing problem, and this is compared to
the performance of the original adaptive critic system. The thesis presents an analysis of
the results from these simulations, and discusses the significance of these results in terms
of addressing the relearning problem
Final report key contents: main results accomplished by the EU-Funded project IM-CLeVeR - Intrinsically Motivated Cumulative Learning Versatile Robots
This document has the goal of presenting the main scientific and technological achievements of the project IM-CLeVeR. The document is organised as follows: 1. Project executive summary: a brief overview of the project vision, objectives and keywords. 2. Beneficiaries of the project and contacts: list of Teams (partners) of the project, Team Leaders and contacts. 3. Project context and objectives: the vision of the project and its overall objectives 4. Overview of work performed and main results achieved: a one page overview of the main results of the project 5. Overview of main results per partner: a bullet-point list of main results per partners 6. Main achievements in detail, per partner: a throughout explanation of the main results per partner (but including collaboration work), with also reference to the main publications supporting them
Building Bridges between Perceptual and Economic Decision-Making: Neural and Computational Mechanisms
Investigation into the neural and computational bases of decision-making has proceeded in two parallel but distinct streams. Perceptual decision-making (PDM) is concerned with how observers detect, discriminate, and categorize noisy sensory information. Economic decision-making (EDM) explores how options are selected on the basis of their reinforcement history. Traditionally, the sub-fields of PDM and EDM have employed different paradigms, proposed different mechanistic models, explored different brain regions, disagreed about whether decisions approach optimality. Nevertheless, we argue that there is a common framework for understanding decisions made in both tasks, under which an agent has to combine sensory information (what is the stimulus) with value information (what is it worth). We review computational models of the decision process typically used in PDM, based around the idea that decisions involve a serial integration of evidence, and assess their applicability to decisions between good and gambles. Subsequently, we consider the contribution of three key brain regions – the parietal cortex, the basal ganglia, and the orbitofrontal cortex (OFC) – to perceptual and EDM, with a focus on the mechanisms by which sensory and reward information are integrated during choice. We find that although the parietal cortex is often implicated in the integration of sensory evidence, there is evidence for its role in encoding the expected value of a decision. Similarly, although much research has emphasized the role of the striatum and OFC in value-guided choices, they may play an important role in categorization of perceptual information. In conclusion, we consider how findings from the two fields might be brought together, in order to move toward a general framework for understanding decision-making in humans and other primates
Exploring model-based and model-free reinforcement learning in obsessive-compulsive disorder
RESUMO: A Perturbação Obsessivo-Compulsiva (POC) é uma doença neuropsiquiátrica
comum, grave e incapacitante, para a qual os tratamentos actuais são ineficazes num
grande número de casos. O instrumento mais utilizado para avaliar a gravidade de
sintomas obsessivo-compulsivos é a Yale-Brown Obsessive-Compulsive Scale (YBOCS), que foi recentemente revista (Y-BOCS-II). No entanto, a sua validade de
construto (tanto divergente como convergente) tem sido reportada como moderada e
a sua validade de critério para diagnóstico de POC nunca foi testada. No primeiro
capítulo desta tese testei, pela primeira vez, a validade de critério da Y-BOCS-II e
demonstrei que um ponto de corte de 13 (pontuação total) atinge o melhor balanço
entre sensibilidade e especificidade para o diagnóstico de POC. No entanto, confirmei
que a sua validade divergente está longe de ser excelente. Este último achado levoume a procurar outros potenciais marcadores de POC.
Têm sido demonstradas várias anomalias em doentes com POC utilizando
tarefas neuropsicológicas ou técnicas de neuroimagem. Contudo, não existe ainda
um marcador consistente para esta perturbação, que seja capaz de discriminar
eficazmente pacientes que sofrem de POC, que seja sensível à mudança após
intervenções terapêuticas e para o qual seja possível estabelecer uma
correspondência com circuitos ou função cerebral. Uma abordagem que tem sido
seguida nos últimos anos considera a POC como sendo caracterizada por uma
disfunção nos sistemas cerebrais responsáveis pela aprendizagem de acções. As
tarefas de decisão sequencial emergiram recentemente como um instrumento
importante e sofisticado para estudar a aprendizagem de acções em humanos através
da abordagem de reinforcement learning (RL). De acordo com a teoria subjacente ao
RL, as acções podem ser aprendidas de duas formas distintas: um sistema modelbased funciona através da construção de um modelo interno das dinâmicas do
ambiente e utiliza esse modelo para planear trajectórias comportamentais futuras, por
oposição a um sistema model-free, que funciona armazenando o valor estimado das
acções que foram implementadas recentemente e actualizando essas estimativas por
tentativa e erro. As chamadas tarefas de decisão sequencial têm vindo a ser utilizadas
para estabelecer associações entre disfunção de sistemas cerebrais de RL e algumas
perturbações neuropsiquiátricas, como a POC, sendo que um desequilíbrio entre os
sistemas model-based e model-free tem sido descrito. Através da aplicação de uma
dessas tarefas de decisão sequencial, a two-step task, existe evidência que sugere
que os doentes com POC têm um défice no sistema model-based. No entanto, neste
paradigma em particular, antes de desempenhar esta tarefa os indivíduos recebem
informação detalhada sobre a estrutura da mesma. Assim, não é claro como os dois
principais sistemas de RL interagem quando os indivíduos aprendem exclusivamente
através de interacção com o ambiente e como a informação explícita afecta as
estratégias de RL. No segundo capítulo desta tese, desenvolvi uma nova tarefa de
decisões sequenciais que permite não só quantificar o uso de estratégias modelbased RL e model-free RL, mas também diferenciar entre o impacto do conhecimento explícito da estrutura da tarefa e o impacto da experiência na mesma. Os resultados
da aplicação da tarefa em indivíduos saudáveis demonstram que inicialmente a
escolha de acções é controlada por aprendizagem model-free, com a aprendizagem
model-based emergindo apenas numa minoria de indivíduos depois de experiência
significativa com a tarefa, não emergindo de todo em indivíduos com POC, que por
sua vez mostraram tendência para aumentar o uso de model-free RL com a
experiência. Quando foi dada informação explícita sobre a estrutura da tarefa,
observou-se um aumento dramático do uso de aprendizagem model-based, tanto nos
voluntários saudáveis como em ambos os grupos clínicos. A informação explícita
diminuiu o uso do sistema de aprendizagem model-free nos voluntários saudáveis e
nos pacientes com perturbação do humor e ansiedade, mas essa diminuição não foi
estatisticamente significativa no grupo de doentes com POC. Para além disso, depois
das instruções, verificou-se em todos os grupos que a actualização do valor das
acções aprendidas através do sistema model-free passou a ser mais influenciada
pelo valor dos estados atingidos e menos influenciada pela consequência dos
ensaios. Outro efeito da informação explícita sobre a estrutura da tarefa nos
indivíduos saudáveis foi tornar as escolhas mais perseverantes, o que é consistente
com uma modificação da estratégia de exploração. Estes resultados ajudam a
clarificar o perfil de utilização de estratégias de RL dos pacientes com POC, que
apresentam défice inespecíficos de aprendizagem model-based e achados mais
específicos de maior uso de aprendizagem model-free, em ambos os casos antes de
obterem informação sobrea estrutura da tarefa.
Por fim, como a literatura ainda não é consensual sobre a interação entre um
eventual sistema de model-based RL e um sistema de model-free RL nos circuitos
cerebrais em humanos, devenvolvi um protocolo de ressonância magnética funcional
para avaliar a escolha de ação sequencial com e sem instruções. Os resultados
preliminares, em indivíduos saudáveis, sugerem que a reduced two-step task permite
separar comportamento que utiliza aprendizagem predominantemente model-free
(antes das instruções) de comportamento que utiliza aprendizagem
predominantemente model-based (após as instruções), no mesmo indivíduo,
estrutura da tarefa e ambiente. A análise dos dados de imagem funcional sugere que
o conhecimento explícito sobre a estrutura da tarefa modifica a atividade neuronal no
córtex paracingulado (cortex prefrontal medial) durante a transição do primeiro para
o segundo passo da tarefa. Objectivos futuros incluem o uso de técnicas de análise
multivariada para explorar a representação cerebral dos estados da tarefa e a
aplicação deste protocolo de ressonância magnética funcional em populações
clínicas.ABSTRACT: Obsessive-compulsive disorder (OCD) is a common, chronic and disabling
neuropsychiatric condition for which current treatments are ineffective in a large
proportion of cases. The gold-standard instrument to assess the severity of OCD
symptoms is the Yale-Brown Obsessive-Compulsive Scale (Y-BOCS), which was
recently revised (Y-BOCS-II). However, its construct validity has been reported has
moderate and its criterion-related validity for the diagnosis of OCD has never been
tested. In the first chapter of this dissertation, I tested, for the first time, criterion-related
validity of the Y-BOCS-II and demonstrated that a cut-off of 13 (total score) attains the
best balance between sensitivity and specificity for the diagnosis of OCD. However, I
confirmed that its divergent validity is far from excellent. This last finding led me to
search for other potential markers of OCD.
Several abnormalities have been demonstrated in OCD patients in studies
using neuropsychological and neuroimaging approaches, but we still lack a consistent
marker for the disorder which is able to discriminate patients with OCD from healthy
subjects or from patients with other mental disorders, which is sensitive to treatmentinduced changes, and which can be mapped to brain circuits or function. An approach
which has been followed over the last decade is considering OCD as a disorder of
action learning systems of the brain. Sequential decision tasks have recently emerged
as an influential and sophisticated tool to investigate action learning in humans through
the reinforcement learning (RL) framework. According to the RL framework, actions
can be learned in two different ways: model-based control works by learning a model
of the dynamics of the environment and later using that model to plan future behavioral
trajectories, while model-free control works by storing the estimated value of recently
taken actions and updating these estimates by trial-and-error. Sequential decision
tasks have been used to assess associations between dysfunction in RL control
systems and certain behavioral disorders, such as OCD, where an unbalance between
model-based and model-free RL has been hypothesized. In fact, using the most
commonly applied sequential decision task, the two-step task, evidence has been
produced suggesting that OCD patients have a deficit in model-based learning.
However, in this specific paradigm, subjects typically receive detailed information
about task structure prior to performing the task. Thus, it remains unclear how different
RL systems contribute when subjects learn exclusively from experience, and how
explicit information about task structure modifies RL strategy. To address these
questions, I created a sequential decision task requiring minimal prior instruction, the
reduced two-step task. I assessed performance both prior to and after delivering
explicit information on task structure, in healthy volunteers, patients with OCD and
patients with other mood and anxiety disorders. Initially model-free control dominated,
with model-based control emerging only in a minority of subjects after significant task
experience, and not at all in patients with OCD, who had instead a tendency to
increase their use of model-free control. Once explicit information about task structure
was provided, a dramatic increase in the use of model-based RL was observed,similarly across healthy volunteers and both patient groups, including OCD. The
debriefing also significantly decreased the use of model-free RL in healthy volunteers
and in patients with mood and anxiety disorders, but not in OCD patients. Additionally,
after instructions, model-free action value updates were influenced more by state
values and less by trial outcomes, in all groups, and subject choices became more
perseverative in healthy subjects, consistent with changes in exploration strategy.
These results help in clarifying the RL profile for patients with OCD, with unspecific
findings of deficient model-based control, and more specific findings of enhanced
model-free control, in both cases prior to information about task structure.
Finally, as the literature is not yet consensual on how model-free and modelbased RL systems interact in human brain circuits, I developed a functional magnetic
resonance imaging (fMRI) protocol to assess uninstructed and instructed sequential
action choice. Preliminary results in healthy subjects suggest that the fMRI version of
the reduced two-step task allows to separate predominantly model-free control (before
instructions) from predominantly model-based control (after instructions), in the same
subject, task structure and environment. Across all sessions, choice events were
associated with increases blood-oxygen-level-dependent (BOLD) activity in the left
precentral gyrus and reward events were associated with increased BOLD activity in
the ventral striatum. I found that explicit knowledge about task structure modifies
blood-oxygen-level-dependent (BOLD) activity in the paracingulate cortex (medial
prefrontal cortex) during the transition from the first- to the second-step of the task.
Future directions include using multivariate pattern analysis techniques to explore how
the brain represents state space in sequential decision tasks and applying the current
fMRI protocol in clinical populations
Strategic Cognitive Sequencing: A Computational Cognitive Neuroscience Approach
We address strategic cognitive sequencing, the “outer loop” of human cognition: how the brain decides what cognitive process to apply at a given moment to solve complex, multistep cognitive tasks. We argue that this topic has been neglected relative to its importance for systematic reasons but that recent work on how individual brain systems accomplish their computations has set the stage for productively addressing how brain regions coordinate over time to accomplish our most impressive thinking. We present four preliminary neural network models. The first addresses how the prefrontal cortex (PFC) and basal ganglia (BG) cooperate to perform trial-and-error learning of short sequences; the next, how several areas of PFC learn to make predictions of likely reward, and how this contributes to the BG making decisions at the level of strategies. The third models address how PFC, BG, parietal cortex, and hippocampus can work together to memorize sequences of cognitive actions from instruction (or “self-instruction”). The last shows how a constraint satisfaction process can find useful plans. The PFC maintains current and goal states and associates from both of these to find a “bridging” state, an abstract plan. We discuss how these processes could work together to produce strategic cognitive sequencing and discuss future directions in this area
Adaptive networks for robotics and the emergence of reward anticipatory circuits
Currently the central challenge facing evolutionary robotics is to determine
how best to extend the range and complexity of behaviour supported by evolved
neural systems. Implicit in the work described in this thesis is the idea that this
might best be achieved through devising neural circuits (tractable to evolutionary
exploration) that exhibit complementary functional characteristics. We concentrate
on two problem domains; locomotion and sequence learning. For locomotion
we compare the use of GasNets and other adaptive networks. For sequence learning
we introduce a novel connectionist model inspired by the role of dopamine
in the basal ganglia (commonly interpreted as a form of reinforcement learning).
This connectionist approach relies upon a new neuron model inspired by notions
of energy efficient signalling. Two reward adaptive circuit variants were investigated.
These were applied respectively to two learning problems; where action
sequences are required to take place in a strict order, and secondly, where action
sequences are robust to intermediate arbitrary states. We conclude the thesis
by proposing a formal model of functional integration, encompassing locomotion
and sequence learning, extending ideas proposed by W. Ross Ashby.
A general model of the adaptive replicator is presented, incoporating subsystems
that are tuned to continuous variation and discrete or conditional events.
Comparisons are made with Ross W. Ashby's model of ultrastability and his
ideas on adaptive behaviour. This model is intended to support our assertion
that, GasNets (and similar networks) and reward adaptive circuits of the type
presented here, are intrinsically complementary. In conclusion we present some
ideas on how the co-evolution of GasNet and reward adaptive circuits might lead
us to significant improvements in the synthesis of agents capable of exhibiting
complex adaptive behaviour
A computational model of cortical-striatal mediation of speed-accuracy tradeoff and habit formation emerging from anatomical gradients in dopamine physiology and reinforcement learning
Decision making – committing to a single action from a plethora of viable alternatives – is a necessity for all motile creatures, each moving a single body to many possible destinations. Some decisions are better than others. For example, to a rat deciding between one path that will bring it to a piece of cheese and another that will bring it to the jaws of a cat, there is a clear reason for the rat to prefer one choice over the other. Two criteria for adjusting decision making for optimal outcome are to make decisions as accurately as possible – choose the course of action most likely to result in the preferred outcome – but also to decide as fast as possible. Because these criteria often conflict, decision making has an inherent “speed-accuracy tradeoff”.
Presented here is a computational neural model of decision making, which incorporates neurobiological design principles that optimize this tradeoff via reward-guided transfers of control between two sensory processing systems with different speed/accuracy characteristics. This model incorporates anatomical and physiological evidence that dopamine, the key neurotransmitter in reinforcement learning, has varying effects in different sub-regions of the basal ganglia, a subcortical structure that interfaces with the neocortex to control behavior. Based on the observed differences between these sub-regions, the model proposes that gradual adaptations of synaptic links by reinforcement learning signals lead to rapid changes in the speed and accuracy of decision making, by assigning control of behavior to alternative cortical representations. Chapter one draws conceptual links from experimental data to the design of the proposed model. Chapter two applies the model to speed-accuracy tradeoffs and habit formation by simulating forced-choice paradigms. Several robust behavioral phenomena are replicated.
By isolating reinforcement learning factors that control the speed and depth of habit formation, the model can help explain why all substances that strongly and synergistically affect such factors share a high potential for habit formation, or habit abatement. To illustrate such potential applications of the current model, chapter three investigates effects of varying model parameters in accord with the known neurochemical effects of some major habit-forming substances, such as cocaine and ethanol
Attentional control in categorisation: towards a computational synthesis
This thesis develops an integrated computational model of task switching in heterogeneous
categorisation by combining theories of cognitive control and category learning. The thesis
considers the strengths and shortcomings of a range of existing computational accounts of
categorisation (ALCOVE, SUSTAIN, ATRIUM and COVIS) by reimplementing each and
applying each to human data from the categorisation literature. It is argued that most of these
models cannot account for heterogeneous categorisation, i.e., situations where the category
structure includes subsets with incompatible boundaries. Moreover, the only one of the four
computational models that can account for heterogeneous categorisation, ATRIUM, does not
completely account for the influence of top-down control during categorisation tasks. The
models are also limited because they are based purely on feedforward principles, and while they
are able to learn to categorise stimuli adequately, they do not account for categorisation response
times, or for task-switching effects observed in recent research on heterogeneous categorisation.
In order to address these limitations, the thesis presents a model that combines an interactive
activation account of task-switching with a modular architecture of categorisation. The model is
shown to successfully simulate reaction time costs and effects of preparation time on task
switching
Attentional control in categorisation: towards a computational synthesis
This thesis develops an integrated computational model of task switching in heterogeneous
categorisation by combining theories of cognitive control and category learning. The thesis
considers the strengths and shortcomings of a range of existing computational accounts of
categorisation (ALCOVE, SUSTAIN, ATRIUM and COVIS) by reimplementing each and
applying each to human data from the categorisation literature. It is argued that most of these
models cannot account for heterogeneous categorisation, i.e., situations where the category
structure includes subsets with incompatible boundaries. Moreover, the only one of the four
computational models that can account for heterogeneous categorisation, ATRIUM, does not
completely account for the influence of top-down control during categorisation tasks. The
models are also limited because they are based purely on feedforward principles, and while they
are able to learn to categorise stimuli adequately, they do not account for categorisation response
times, or for task-switching effects observed in recent research on heterogeneous categorisation.
In order to address these limitations, the thesis presents a model that combines an interactive
activation account of task-switching with a modular architecture of categorisation. The model is
shown to successfully simulate reaction time costs and effects of preparation time on task
switching
- …