58,155 research outputs found

    Generalized Off-Policy Actor-Critic

    Full text link
    We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201

    Off-Policy Actor-Critic

    Get PDF
    This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings of the 2012 International Conference on Machine Learnin

    Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

    Full text link
    Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML

    Noisy Importance Sampling Actor-Critic: An Off-Policy Actor-Critic With Experience Replay

    Get PDF
    This paper presents Noisy Importance Sampling Actor-Critic (NISAC), a set of empirically validated modifications to the advantage actor-critic algorithm (A2C), allowing off-policy reinforcement learning and increased performance. NISAC uses additive action space noise, aggressive truncation of importance sample weights, and large batch sizes. We see that additive noise drastically changes how off-sample experience is weighted for policy updates. The modified algorithm achieves an increase in convergence speed and sample efficiency compared to both the on-policy actor-critic A2C and the importance weighted off-policy actor-critic algorithm. In comparison to state-of-the-art (SOTA) methods, such as actor-critic with experience replay (ACER), NISAC nears the performance on several of the tested environments while training 40% faster and being significantly easier to implement. The effectiveness of NISAC is demonstrated against existing on-policy and off-policy actor-critic algorithms on a subset of the Atari domain

    Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning

    Full text link
    Off-policy learning is more unstable compared to on-policy learning in reinforcement learning (RL). One reason for the instability of off-policy learning is a discrepancy between the target (π\pi) and behavior (b) policy distributions. The discrepancy between π\pi and b distributions can be alleviated by employing a smooth variant of the importance sampling (IS), such as the relative importance sampling (RIS). RIS has parameter β[0,1]\beta\in[0, 1] which controls smoothness. To cope with instability, we present the first relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free algorithms in RL. In our method, the network yields a target policy (the actor), a value function (the critic) assessing the current policy (π\pi) using samples drawn from behavior policy. We use action value generated from the behavior policy in reward function to train our algorithm rather than from the target policy. We also use deep neural networks to train both actor and critic. We evaluated our algorithm on a number of Open AI Gym benchmark problems and demonstrate better or comparable performance to several state-of-the-art RL baselines

    Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

    Get PDF
    Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in avariety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC.Comment: NeurIPS 202
    corecore