3,258 research outputs found
Recommended from our members
Action selection in modular reinforcement learning
textModular reinforcement learning is an approach to resolve the curse of dimensionality problem in traditional reinforcement learning. We design and implement a modular reinforcement learning algorithm, which is based on three major components: Markov decision process decomposition, module training, and global action selection. We define and formalize module class and module instance concepts in decomposition step. Under our framework of decomposition, we train each modules efficiently using SARSA() algorithm. Then we design, implement, test, and compare three action selection algorithms based on different heuristics: Module Combination, Module Selection, and Module Voting. For last two algorithms, we propose a method to calculate module weights efficiently, by using standard deviation of Q-values of each module. We show that Module Combination and Module Voting algorithms produce satisfactory performance in our test domain.Computer Science
Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning
Temporal difference (TD) methods constitute a class of methods for learning
predictions in multi-step prediction problems, parameterized by a recency
factor lambda. Currently the most important application of these methods is to
temporal credit assignment in reinforcement learning. Well known reinforcement
learning algorithms, such as AHC or Q-learning, may be viewed as instances of
TD learning. This paper examines the issues of the efficient and general
implementation of TD(lambda) for arbitrary lambda, for use with reinforcement
learning algorithms optimizing the discounted sum of rewards. The traditional
approach, based on eligibility traces, is argued to suffer from both
inefficiency and lack of generality. The TTD (Truncated Temporal Differences)
procedure is proposed as an alternative, that indeed only approximates
TD(lambda), but requires very little computation per action and can be used
with arbitrary function representation methods. The idea from which it is
derived is fairly simple and not new, but probably unexplored so far.
Encouraging experimental results are presented, suggesting that using lambda
> 0 with the TTD procedure allows one to obtain a significant learning
speedup at essentially the same cost as usual TD(0) learning.Comment: See http://www.jair.org/ for any accompanying file
Eligibility Traces and Plasticity on Behavioral Time Scales: Experimental Support of neoHebbian Three-Factor Learning Rules
Most elementary behaviors such as moving the arm to grasp an object or
walking into the next room to explore a museum evolve on the time scale of
seconds; in contrast, neuronal action potentials occur on the time scale of a
few milliseconds. Learning rules of the brain must therefore bridge the gap
between these two different time scales.
Modern theories of synaptic plasticity have postulated that the co-activation
of pre- and postsynaptic neurons sets a flag at the synapse, called an
eligibility trace, that leads to a weight change only if an additional factor
is present while the flag is set. This third factor, signaling reward,
punishment, surprise, or novelty, could be implemented by the phasic activity
of neuromodulators or specific neuronal inputs signaling special events. While
the theoretical framework has been developed over the last decades,
experimental evidence in support of eligibility traces on the time scale of
seconds has been collected only during the last few years.
Here we review, in the context of three-factor rules of synaptic plasticity,
four key experiments that support the role of synaptic eligibility traces in
combination with a third factor as a biological implementation of neoHebbian
three-factor learning rules
Recommended from our members
Retrospective model-based inference guides model-free credit assignment
An extensive reinforcement learning literature shows that organisms assign credit efficiently, even under conditions of state uncertainty. However, little is known about credit-assignment when state uncertainty is subsequently resolved. Here, we address this problem within the framework of an interaction between model-free (MF) and model-based (MB) control systems. We present and support experimentally a theory of MB retrospective-inference. Within this framework, a MB system resolves uncertainty that prevailed when actions were taken thus guiding an MF credit-assignment. Using a task in which there was initial uncertainty about the lotteries that were chosen, we found that when participants’ momentary uncertainty about which lottery had generated an outcome was resolved by provision of subsequent information, participants preferentially assigned credit within a MF system to the lottery they retrospectively inferred was responsible for this outcome. These findings extend our knowledge about the range of MB functions and the scope of system interactions
- …