1,278 research outputs found
Addressing Function Approximation Error in Actor-Critic Methods
In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and the critic. Our algorithm builds on Double Q-learning, by
taking the minimum value between a pair of critics to limit overestimation. We
draw the connection between target networks and overestimation bias, and
suggest delaying policy updates to reduce per-update error and further improve
performance. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
We consider infinite-horizon stationary -discounted Markov Decision
Processes, for which it is known that there exists a stationary optimal policy.
Using Value and Policy Iteration with some error at each iteration,
it is well-known that one can compute stationary policies that are
-optimal. After arguing that this
guarantee is tight, we develop variations of Value and Policy Iteration for
computing non-stationary policies that can be up to
-optimal, which constitutes a significant
improvement in the usual situation when is close to 1. Surprisingly,
this shows that the problem of "computing near-optimal non-stationary policies"
is much simpler than that of "computing near-optimal stationary policies"
Beyond the One Step Greedy Approach in Reinforcement Learning
The famous Policy Iteration algorithm alternates between policy improvement
and policy evaluation. Implementations of this algorithm with several variants
of the latter evaluation stage, e.g, -step and trace-based returns, have
been analyzed in previous works. However, the case of multiple-step lookahead
policy improvement, despite the recent increase in empirical evidence of its
strength, has to our knowledge not been carefully analyzed yet. In this work,
we introduce the first such analysis. Namely, we formulate variants of
multiple-step policy improvement, derive new algorithms using these definitions
and prove their convergence. Moreover, we show that recent prominent
Reinforcement Learning algorithms are, in fact, instances of our framework. We
thus shed light on their empirical success and give a recipe for deriving new
algorithms for future study.Comment: ICML 201
Reinforcement Learning: A Survey
This paper surveys the field of reinforcement learning from a
computer-science perspective. It is written to be accessible to researchers
familiar with machine learning. Both the historical basis of the field and a
broad selection of current work are summarized. Reinforcement learning is the
problem faced by an agent that learns behavior through trial-and-error
interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but differs considerably in the details and
in the use of the word ``reinforcement.'' The paper discusses central issues of
reinforcement learning, including trading off exploration and exploitation,
establishing the foundations of the field via Markov decision theory, learning
from delayed reinforcement, constructing empirical models to accelerate
learning, making use of generalization and hierarchy, and coping with hidden
state. It concludes with a survey of some implemented systems and an assessment
of the practical utility of current methods for reinforcement learning.Comment: See http://www.jair.org/ for any accompanying file
Examining average and discounted reward optimality criteria in reinforcement learning
In reinforcement learning (RL), the goal is to obtain an optimal policy, for
which the optimality criterion is fundamentally important. Two major optimality
criteria are average and discounted rewards, where the later is typically
considered as an approximation to the former. While the discounted reward is
more popular, it is problematic to apply in environments that have no natural
notion of discounting. This motivates us to revisit a) the progression of
optimality criteria in dynamic programming, b) justification for and
complication of an artificial discount factor, and c) benefits of directly
maximizing the average reward. Our contributions include a thorough examination
of the relationship between average and discounted rewards, as well as a
discussion of their pros and cons in RL. We emphasize that average-reward RL
methods possess the ingredient and mechanism for developing the general
discounting-free optimality criterion (Veinott, 1969) in RL.Comment: 14 pages, 3 figures, 10-page main conten
- …