11 research outputs found

    A neural network model for the orbitofrontal cortex and task space acquisition during reinforcement learning

    No full text
    <div><p>Reinforcement learning has been widely used in explaining animal behavior. In reinforcement learning, the agent learns the value of the states in the task, collectively constituting the task state space, and uses the knowledge to choose actions and acquire desired outcomes. It has been proposed that the orbitofrontal cortex (OFC) encodes the task state space during reinforcement learning. However, it is not well understood how the OFC acquires and stores task state information. Here, we propose a neural network model based on reservoir computing. Reservoir networks exhibit heterogeneous and dynamic activity patterns that are suitable to encode task states. The information can be extracted by a linear readout trained with reinforcement learning. We demonstrate how the network acquires and stores task structures. The network exhibits reinforcement learning behavior and its aspects resemble experimental findings of the OFC. Our study provides a theoretical explanation of how the OFC may contribute to reinforcement learning and a new approach to understanding the neural mechanism underlying reinforcement learning.</p></div

    Network analyses for the Two-stage Markov decision task.

    No full text
    <p><b>A.</b> Factorial analysis of choice behavior. The network is more likely to repeat the choice under the conditions common-rewarded (<i>CR</i>) and rare-unrewarded (<i>RN</i>) than under the conditions common-unrewarded (<i>CU</i>) and rare-rewarded (<i>RR</i>). <b>B.</b> The task structure index keeps growing in the intact network (blue line), but stays at a low level when the reward input is missing (red line). Stars indicate significant difference (One-way ANOVA, p<0.05). <b>C.</b> Fitting the behavioral performance with a mixture of task-agnostic and task-aware algorithms. The weight parameter <i>w</i> for learning with the knowledge of the task structure is significantly larger for the intact network (blue data points) than the network without the reward input (red data points). Each data point represents a simulation run. A one-way ANOVA is used to determine the significance (p<0.05). <b>D.</b> PCA on the network population activity. The network states are plotted in the space spanned by the first 3 PCA components. The network can distinguish all 8 different states. <b>E.</b> The weight differences between the connections between SEL neurons and the DML unit <i>A1</i> and DML unit <i>A2</i>. The gray and white areas indicate the blocks in which intermediate outcome <i>B1</i> is more likely to lead to a reward and the blocks in which <i>B2</i> is more likely to lead to a reward, respectively. <b>F.</b> Logistic regression shows that only the last trial’s state affect the choice. The regression includes four different states (intermediate outcome x reward outcome) for each trial up to 10 trials before the current trials. Error bars show s.e.m. across simulation runs. <b>G.</b> Logistic regression reveals that only the combination of the intermediate states and the reward outcome in the last trial affects the decision. The factors being evaluated are: Correct—a tendency to choose the better choice in current block; Reward—a tendency to repeat the previous choice if it is rewarded; Stay—a tendency to repeat the previous choice; Transition—a tendency to repeat the same choice following common intermediate outcomes and switch the choice following rare intermediate outcomes; Trans x Out–a tendency to repeat the same choice if a common intermediate outcome is rewarded or a rare intermediate outcome unrewarded, and to switch the choice if a common intermediate outcome is unrewarded or a rare intermediate outcome rewarded.</p

    Two-stage Markov decision task.

    No full text
    <p><b>A.</b> Task structure of the two-stage Markov decision task. Two options <i>A1</i> and <i>A2</i> are available, they lead to two intermediate outcomes <i>B1</i> and <i>B2</i> at different probabilities. The width of the arrows indicates the transition probability. Intermediate outcomes <i>B1</i> and <i>B2</i> lead to rewards at different probability, and the reward contingency of the intermediate outcomes is reversed between blocks. <b>B.</b> The schematic diagram of the model. It is similar to the model in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005925#pcbi.1005925.g001" target="_blank">Fig 1A</a>. The only difference is that there are more input units. <b>C.</b> The event sequence. Units in the input layer are activated sequentially. In the example trial, option <i>A1</i> is chosen, <i>B1</i> is presented, and a reward is obtained.</p

    Value-based decision-making task.

    No full text
    <p><b>A.</b> The schematic diagram of the model. <b>B.</b> The event sequence. The stimuli are presented between 300 ms and 1300 ms after the trial onset. The decision is computed with the neural activity at 1400 ms after the trial onset. The input neurons’ activity profiles mimic those of real neurons (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005925#sec013" target="_blank">Methods</a>). <b>C.</b> Choice pattern. The relative value preference calculated based on the network behavior is indicated on the top left, and the actual relative value preference used in the simulation is 1<i>A</i> = 2<i>B</i>.</p

    Network analyses for the reversal learning task.

    No full text
    <p><b>A.</b> Selectivity of three example neurons in the reservoir network. Input units are set to 1 from 200ms to 700ms. Left panel: an example neuron that encodes choice options; middle panel: an example neuron that encodes reward outcomes; right panel: an example neuron with mixed selectivity. <b>B.</b> PCA on the network population activity. The network states are plotted in the space spanned by the first 3 PCA components. The activities in different conditions are differentiated after the cue onset. <b>C.</b> The difference between the SEL neurons’ connection weights to DML unit <i>A</i> and DML unit <i>B</i>. The SEL neurons are grouped according to their selectivities. For example, <i>AR</i> represents the group of neurons that respond most strongly when the input units <i>A</i> and <i>R</i> are both activated. The gray and white area indicates the blocks in which the option <i>A</i> and the option <i>B</i> leads to the reward, respectively. <b>D.</b> Left. The proportion of the blocks in which the network does not reach the performance criterion within a block after we remove 50 neurons that are random chosen (control), A selective, or AR selective. Right. The number of errors that the network makes before reaching the criterion with the same 3 types of inactivation. Only the data from the A-rewarding blocks are analyzed. The error bars are s.e.m. based on 10 simulation runs. A one-way ANOVA is used to determine the significance (p<0.05). <b>E.</b> The number of errors needed to reach the performance criterion is maintained after the training stops at the 50<sup>th</sup> reversal. The error bars are s.e.m. calculated based on 10 simulation runs.</p

    Value selectivity of the network neurons.

    No full text
    <p><b>A.</b> Three example neurons in the SEL. Left panel: a neuron that encodes chosen value; middle panel: a neuron that encodes offer value; right panel: a neuron that encodes chosen juice. <b>B.</b> The proportions of the neurons with different selectivities from a previous experimental study [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005925#pcbi.1005925.ref011" target="_blank">11</a>]. <b>C.</b> The proportions of the neurons in the reservoir network with different selectivities.</p

    Reversal learning task.

    No full text
    <p><b>A.</b> The schematic diagram of the model. The network is composed of three parts: input layer (IL), the state encoding layer (SEL) and the decision-making output layer (DML). <b>B.</b> The event sequence. The stimulus and reward inputs are given concurrently at 200 ms after the trial onset and last for 500 ms. After a 200 ms delay, the decision is computed with the neural activity at 900 ms after the trial onset. <b>C.</b> The number of the error trials made before the network achieves the performance threshold. The dark line indicates the performance of the network with the reward input; the light line indicates the performance of the network without the reward input as a model for animals of OFC lesions. Stars indicate significant difference (One-way ANOVA, p<0.05).</p
    corecore