58,155 research outputs found
Generalized Off-Policy Actor-Critic
We propose a new objective, the counterfactual objective, unifying existing
objectives for off-policy policy gradient algorithms in the continuing
reinforcement learning (RL) setting. Compared to the commonly used excursion
objective, which can be misleading about the performance of the target policy
when deployed, our new objective better predicts such performance. We prove the
Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient
of the counterfactual objective and use an emphatic approach to get an unbiased
sample from this policy gradient, yielding the Generalized Off-Policy
Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over
existing algorithms in Mujoco robot simulation tasks, the first empirical
success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201
Off-Policy Actor-Critic
This paper presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and incremental, and its
per-time-step complexity scales linearly with the number of learned weights.
Previous work on actor-critic algorithms is limited to the on-policy setting
and does not take advantage of the recent advances in off-policy gradient
temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable
a target policy to be learned while following and obtaining data from another
(behavior) policy. For many problems, however, actor-critic methods are more
practical than action value methods (like Greedy-GQ) because they explicitly
represent the policy; consequently, the policy can be stochastic and utilize a
large action space. In this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning with the flexibility
in action selection given by actor-critic methods. We derive an incremental,
linear time and space complexity algorithm that includes eligibility traces,
prove convergence under assumptions similar to previous off-policy algorithms,
and empirically show better or comparable performance to existing algorithms on
standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings
of the 2012 International Conference on Machine Learnin
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Value-based reinforcement-learning algorithms provide state-of-the-art
results in model-free discrete-action settings, and tend to outperform
actor-critic algorithms. We argue that actor-critic algorithms are limited by
their need for an on-policy critic. We propose Bootstrapped Dual Policy
Iteration (BDPI), a novel model-free reinforcement-learning algorithm for
continuous states and discrete actions, with an actor and several off-policy
critics. Off-policy critics are compatible with experience replay, ensuring
high sample-efficiency, without the need for off-policy corrections. The actor,
by slowly imitating the average greedy policy of the critics, leads to
high-quality and state-specific exploration, which we compare to Thompson
sampling. Because the actor and critics are fully decoupled, BDPI is remarkably
stable, and unusually robust to its hyper-parameters. BDPI is significantly
more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete,
continuous and pixel-based tasks. Source code:
https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML
Noisy Importance Sampling Actor-Critic: An Off-Policy Actor-Critic With Experience Replay
This paper presents Noisy Importance Sampling Actor-Critic (NISAC), a set of empirically validated modifications to the advantage actor-critic algorithm (A2C), allowing off-policy reinforcement learning and increased performance. NISAC uses additive action space noise, aggressive truncation of importance sample weights, and large batch sizes. We see that additive noise drastically changes how off-sample experience is weighted for policy updates. The modified algorithm achieves an increase in convergence speed and sample efficiency compared to both the on-policy actor-critic A2C and the importance weighted off-policy actor-critic algorithm. In comparison to state-of-the-art (SOTA) methods, such as actor-critic with experience replay (ACER), NISAC nears the performance on several of the tested environments while training 40% faster and being significantly easier to implement. The effectiveness of NISAC is demonstrated against existing on-policy and off-policy actor-critic algorithms on a subset of the Atari domain
Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning
Off-policy learning is more unstable compared to on-policy learning in
reinforcement learning (RL). One reason for the instability of off-policy
learning is a discrepancy between the target () and behavior (b) policy
distributions. The discrepancy between and b distributions can be
alleviated by employing a smooth variant of the importance sampling (IS), such
as the relative importance sampling (RIS). RIS has parameter
which controls smoothness. To cope with instability, we present the first
relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free
algorithms in RL. In our method, the network yields a target policy (the
actor), a value function (the critic) assessing the current policy ()
using samples drawn from behavior policy. We use action value generated from
the behavior policy in reward function to train our algorithm rather than from
the target policy. We also use deep neural networks to train both actor and
critic. We evaluated our algorithm on a number of Open AI Gym benchmark
problems and demonstrate better or comparable performance to several
state-of-the-art RL baselines
Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety
of continuous control tasks. Normally, the critic's action-value function is
updated using temporal-difference, and the critic in turn provides a loss for
the actor that trains it to take actions with higher expected return. In this
paper, we introduce a novel and flexible meta-critic that observes the learning
process and meta-learns an additional loss for the actor that accelerates and
improves actor-critic learning. Compared to the vanilla critic, the meta-critic
network is explicitly trained to accelerate the learning process; and compared
to existing meta-learning algorithms, meta-critic is rapidly learned online for
a single task, rather than slowly over a family of tasks. Crucially, our
meta-critic framework is designed for off-policy based learners, which
currently provide state-of-the-art reinforcement learning sample efficiency. We
demonstrate that online meta-critic learning leads to improvements in avariety
of continuous control environments when combined with contemporary Off-PAC
methods DDPG, TD3 and the state-of-the-art SAC.Comment: NeurIPS 202
- …