74 research outputs found
Prioritized Sweeping Neural DynaQ with Multiple Predecessors, and Hippocampal Replays
During sleep and awake rest, the hippocampus replays sequences of place cells
that have been activated during prior experiences. These have been interpreted
as a memory consolidation process, but recent results suggest a possible
interpretation in terms of reinforcement learning. The Dyna reinforcement
learning algorithms use off-line replays to improve learning. Under limited
replay budget, a prioritized sweeping approach, which requires a model of the
transitions to the predecessors, can be used to improve performance. We
investigate whether such algorithms can explain the experimentally observed
replays. We propose a neural network version of prioritized sweeping
Q-learning, for which we developed a growing multiple expert algorithm, able to
cope with multiple predecessors. The resulting architecture is able to improve
the learning of simulated agents confronted to a navigation task. We predict
that, in animals, learning the world model should occur during rest periods,
and that the corresponding replays should be shuffled.Comment: Living Machines 2018 (Paris, France
MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning
Reinforcement learning has become one of the best approach to train a
computer game emulator capable of human level performance. In a reinforcement
learning approach, an optimal value function is learned across a set of
actions, or decisions, that leads to a set of states giving different rewards,
with the objective to maximize the overall reward. A policy assigns to each
state-action pairs an expected return. We call an optimal policy a policy for
which the value function is optimal. QLBS, Q-Learner in the
Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and
noticeably, the popular Q-learning algorithm, to the financial stochastic model
of Black, Scholes and Merton. It is, however, specifically optimized for the
geometric Brownian motion and the vanilla options. Its range of application is,
therefore, limited to vanilla option pricing within financial markets. We
propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement
learning approach that determines the optimal policy of money management based
on the aggregated financial transactions of the clients. It unlocks new
frontiers to establish personalized credit card limits or to fulfill bank loan
applications, targeting the retail banking industry. MQLV extends the
simulation to mean reverting stochastic diffusion processes and it uses a
digital function, a Heaviside step function expressed in its discrete form, to
estimate the probability of a future event such as a payment default. In our
experiments, we first show the similarities between a set of historical
financial transactions and Vasicek generated transactions and, then, we
underline the potential of MQLV on generated Monte Carlo simulations. Finally,
MQLV is the first Q-learning Vasicek-based methodology addressing transparent
decision making processes in retail banking
Deep Residual Reinforcement Learning
We revisit residual algorithms in both model-free and model-based
reinforcement learning settings. We propose the bidirectional target network
technique to stabilize residual algorithms, yielding a residual version of DDPG
that significantly outperforms vanilla DDPG in the DeepMind Control Suite
benchmark. Moreover, we find the residual algorithm an effective approach to
the distribution mismatch problem in model-based planning. Compared with the
existing TD() method, our residual-based method makes weaker assumptions
about the model and yields a greater performance boost.Comment: AAMAS 202
Generalized Off-Policy Actor-Critic
We propose a new objective, the counterfactual objective, unifying existing
objectives for off-policy policy gradient algorithms in the continuing
reinforcement learning (RL) setting. Compared to the commonly used excursion
objective, which can be misleading about the performance of the target policy
when deployed, our new objective better predicts such performance. We prove the
Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient
of the counterfactual objective and use an emphatic approach to get an unbiased
sample from this policy gradient, yielding the Generalized Off-Policy
Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over
existing algorithms in Mujoco robot simulation tasks, the first empirical
success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201
- …