24,647 research outputs found
Off-Policy Actor-Critic
This paper presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and incremental, and its
per-time-step complexity scales linearly with the number of learned weights.
Previous work on actor-critic algorithms is limited to the on-policy setting
and does not take advantage of the recent advances in off-policy gradient
temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable
a target policy to be learned while following and obtaining data from another
(behavior) policy. For many problems, however, actor-critic methods are more
practical than action value methods (like Greedy-GQ) because they explicitly
represent the policy; consequently, the policy can be stochastic and utilize a
large action space. In this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning with the flexibility
in action selection given by actor-critic methods. We derive an incremental,
linear time and space complexity algorithm that includes eligibility traces,
prove convergence under assumptions similar to previous off-policy algorithms,
and empirically show better or comparable performance to existing algorithms on
standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings
of the 2012 International Conference on Machine Learnin
Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation
We present the first provably convergent two-timescale off-policy
actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is
the introduction of a new critic, the emphasis critic, which is trained via
Gradient Emphasis Learning (GEM), a novel combination of the key ideas of
Gradient Temporal Difference Learning and Emphatic Temporal Difference
Learning. With the help of the emphasis critic and the canonical value function
critic, we show convergence for COF-PAC, where the critics are linear and the
actor can be nonlinear.Comment: ICML 202
Finite Sample Analysis of Mean-Volatility Actor-Critic for Risk-Averse Reinforcement Learning
The goal in the standard reinforcement learning problem is to find a policy that optimizes the expected return. However, such an objective is not adequate in a lot of real-life applications, like finance, where controlling the uncertainty of the outcome is imperative. The mean-volatility objective penalizes, through a tunable parameter, policies with high variance of the per-step reward. An interesting property of this objective is that it admits simple linear Bellman equations that resemble, up to a reward transformation, those of the risk-neutral case. However, the required reward transformation is policy-dependent, and requires the (usually unknown) expected return of the used policy. In this work, we propose two general methods for policy evaluation under the mean-volatility objective: the direct method and the factored method. We then extend recent results for finite sample analysis in the risk-neutral actor-critic setting to the mean-volatility case. Our analysis shows that the sample complexity to attain an ϵ-accurate stationary point is the same as that of the risk-neutral version, using either policy evaluation method for training the critic. Finally, we carry out experiments to test the proposed methods in a simple environment that exhibits some trade-off between optimality, in expectation, and uncertainty of outcome
- …