34 research outputs found

    Functional Requirements for Reward-Modulated Spike-Timing-Dependent Plasticity

    Get PDF
    Recent experiments have shown that spike-timing-dependent plasticity is influenced by neuromodulation. We derive theoretical conditions for successful learning of reward-related behavior for a large class of learning rules where Hebbian synaptic plasticity is conditioned on a global modulatory factor signaling reward. We show that all learning rules in this class can be separated into a term that captures the covariance of neuronal firing and reward and a second term that presents the influence of unsupervised learning. The unsupervised term, which is, in general, detrimental for reward-based learning, can be suppressed if the neuromodulatory signal encodes the difference between the reward and the expected reward—but only if the expected reward is calculated for each task and stimulus separately. If several tasks are to be learned simultaneously, the nervous system needs an internal critic that is able to predict the expected reward for arbitrary stimuli. We show that, with a critic, reward-modulated spike-timing-dependent plasticity is capable of learning motor trajectories with a temporal resolution of tens of milliseconds. The relation to temporal difference learning, the relevance of block-based learning paradigms, and the limitations of learning with a critic are discussed

    Towards comprehensive observing and modeling systems for monitoring and predicting regional to coastal sea level

    Get PDF
    A major challenge for managing impacts and implementing effective mitigation measures and adaptation strategies for coastal zones affected by future sea level (SL) rise is our limited capacity to predict SL change at the coast on relevant spatial and temporal scales. Predicting coastal SL requires the ability to monitor and simulate a multitude of physical processes affecting SL, from local effects of wind waves and river runoff to remote influences of the large-scale ocean circulation on the coast. Here we assess our current understanding of the causes of coastal SL variability on monthly to multi-decadal timescales, including geodetic, oceanographic and atmospheric aspects of the problem, and review available observing systems informing on coastal SL. We also review the ability of existing models and data assimilation systems to estimate coastal SL variations and of atmosphere-ocean global coupled models and related regional downscaling efforts to project future SL changes. We discuss (1) observational gaps and uncertainties, and priorities for the development of an optimal and integrated coastal SL observing system, (2) strategies for advancing model capabilities in forecasting short-term processes and projecting long-term changes affecting coastal SL, and (3) possible future developments of sea level services enabling better connection of scientists and user communities and facilitating assessment and decision making for adaptation to future coastal SL change.RP was funded by NASA grant NNH16CT00C. CD was supported by the Australian Research Council (FT130101532 and DP 160103130), the Scientific Committee on Oceanic Research (SCOR) Working Group 148, funded by national SCOR committees and a grant to SCOR from the U.S. National Science Foundation (Grant OCE-1546580), and the Intergovernmental Oceanographic Commission of UNESCO/International Oceanographic Data and Information Exchange (IOC/IODE) IQuOD Steering Group. SJ was supported by the Natural Environmental Research Council under Grant Agreement No. NE/P01517/1 and by the EPSRC NEWTON Fund Sustainable Deltas Programme, Grant Number EP/R024537/1. RvdW received funding from NWO, Grant 866.13.001. WH was supported by NASA (NNX17AI63G and NNX17AH25G). CL was supported by NASA Grant NNH16CT01C. This work is a contribution to the PIRATE project funded by CNES (to TP). PT was supported by the NOAA Research Global Ocean Monitoring and Observing Program through its sponsorship of UHSLC (NA16NMF4320058). JS was supported by EU contract 730030 (call H2020-EO-2016, “CEASELESS”). JW was supported by EU Horizon 2020 Grant 633211, Atlantos

    Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

    Get PDF
    Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties

    Models of Reward-Modulated Spike-Timing-Dependent Plasticity

    No full text
    How do animals learn to repeat behaviors that lead to the obtention of food or other “rewarding” objects? As a biologically plausible paradigm for learning in spiking neural networks, spike-timing dependent plasticity (STDP) has been shown to perform well in unsupervised learning tasks such as receptive field development. However, STDP fails to take behavioral relevance into account, and as such is inadequate to explain a vast range of learning tasks in which the final outcome, conditioned on the prior execution of a series of actions, is signaled to an animal through sparse rewards. In this thesis, I show that the addition of a third, global, reward-based factor to the pre- and postsynaptic factors of STDP is a promising solution to this problem, consistent with experimental findings. One one hand, dopamine is a neuromodulator which has been shown to encode reward signals in the brain. On the other hand, STDP has been shown to be affected by dopamine, even though the precise nature of the interaction is unclear. Moreover, the theoretical framework of reinforcement learning provides strong foundation for the analysis of these learning rules. After studying existing examples of such rules in a navigation task, I derive simple functional requirements for reward-modulated learning rules, and illustrate these in a motor learning task. One of those functional requirements is the existence a “critic” structure, constantly evaluating the potential for rewarding events. The implication of the existence of such a critic on the interpretation of psychophysical experiments are also discussed. Finally, I propose a biologically plausible implementation of such a structure, that performs motor or navigational tasks. This is based on a generalization of temporal difference learning, a well-known reinforcement learning framework, to continuous time, well suited to an implementation with spiking neurons. These result provide a unified picture of reward-modulated learning rules: even though different rules have been proposed, these can be reduced to a single model at the synaptic level, with variations in the computation of the neuromodulatory signal enabling switching between different learning rules

    Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

    Get PDF
    Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausibl

    Maze navigation learning task.

    No full text
    <p>A: The maze consists of a square enclosure, with a circular goal area (green) in the center. A U-shaped obstacle (red) makes the task harder by forcing turns on trajectories from three out of the four possible starting locations (crosses). B: Color-coded trajectories of an example TD-LTP agent during the first 75 simulated trials. Early trials (blue) are spent exploring the maze and the obstacles, while later trials (green to red) exploit stereotypical behavior. C: Value map (color map) and policy (vector field) represented by the synaptic weights of the agent of panel B after 2000s simulated seconds. D: Goal reaching latency of agents using different learning rules. Latencies of simulated agents per learning rule are binned by 5 trials (trials 1–5, trials 6–10, etc.). The solid lines shows the median of the latencies for each trial bin and the shaded area represents the 25th to 75th percentiles. For the R-max rule these all fall in the time limit after which a trial was interrupted if the goal was not reached. The R-max agent were simulated without a critic (see main text).</p

    Cartpole task.

    No full text
    <p>A: Cartpole swing-up problem (schematic). The cart slides on a rail of length 5, while the pole of length 1 rotates around its axis, subject to gravity. The state of the system is characterized by , while the control variable is the force exerted on the cart. The agent receives a reward proportional to the height of the pole's tip. B: Cumulative number of “successful” trials as a function of total trials. A successful trial is defined as a trial where the pole angle was maintained up () for more than 10s, out of a maximum trial length . The black line shows the median, and the shaded area represents the quartiles of 20 TD-LTP agents' performance, pooled in bins of 10 trials. The blue line shows the number of successful trials for a single agent. C: Average reward in a given trial. The average reward rate obtained during each trial is shown versus the trial number. After a rapid rise (inset, vertical axis same as main plot), the reward rises in a much slower timescale as the agents learn the finer control needed to keep the pole upright. The line and the area represent the median and the quartiles, as in B. D: Example agent behavior after 4000 trials. The three diagrams show three examples of the same agent recovering from unstable initial conditions (top: pole sideways, center: rightward speed near rail edge, bottom: small angle near rail edge).</p

    Alternative learning rule and nuisance term.

    No full text
    <p>A: Schematic comparison of the squared TD gradient learning rule of <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#pcbi.1003024.e533" target="_blank">Eq. 46</a> and TD-LTP, similar to <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#pcbi-1003024-g002" target="_blank">Figure 2A</a>. B: Linear track task using the squared TD gradient rule. Same conventions as in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#pcbi-1003024-g002" target="_blank">Figure 2C</a>. C: linear track task using the TD-LTP rule (reprint of <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#pcbi-1003024-g002" target="_blank">Figure 2C</a> for comparison). D: Integrands of the disturbance term for Poisson spike train statistics. Top: squared TD gradient rule. Bottom: TD-LTP rule. In each plot the numerical value under the curve is given. This corresponds to the contribution of each presynaptic spike to the nuisance term. E: Disturbance term dependence on for the squared TD gradient rule. The mean weight change under initial conditions on an unrewarded linear track task with frozen weights, using the squared TD gradient learning rule, is plotted versus , the number of neurons composing the critic. Each cross corresponds to the mean over a 200s simulation, the plot shows crosses for each condition. The line shows a fit of the data with , the dependence form suggested by <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#pcbi.1003024.e574" target="_blank">Eq. 50</a>. F: Same as E, for critic neurons using the TD-LTP learning rule. G, H: Same experiment as E and F, but using a rate neuron model with Gaussian noise of mean 0 and variance . The line shows a fit with , the dependence form suggested by <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#pcbi.1003024.e574" target="_blank">Eq. 50</a>.</p

    Acrobot task.

    No full text
    <p>A: The acrobot swing-up task figures a double pendulum, weakly actuated by a torque at the joint. The state of the pendulum is represented by the two angles and and the corresponding angular velocities and . The goal is to lift the tip above a certain height above the fixed axis of the pendulum, corresponding to the length of the segments. B: Goal reaching latency of TD-LTP agents. The solid line shows the median of the latencies for each trial number and the shaded area represents the 25th to 75th percentiles of the agents performance. The red line represents a near-optimal strategy, obtained by the direct search method (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003024#s4" target="_blank">Models</a>). The blue line show the trajectory of one of the best amongst the 100 agents. The dotted line shows the limit after which a trial was interrupted if the agent did not reach the goal. C: Example trajectory of an agent successfully reaching the goal height (green line).</p
    corecore