6 research outputs found
Planning with Expectation Models
Distribution and sample models are two popular model choices in model-based
reinforcement learning (MBRL). However, learning these models can be
intractable, particularly when the state and action spaces are large.
Expectation models, on the other hand, are relatively easier to learn due to
their compactness and have also been widely used for deterministic
environments. For stochastic environments, it is not obvious how expectation
models can be used for planning as they only partially characterize a
distribution. In this paper, we propose a sound way of using approximate
expectation models for MBRL. In particular, we 1) show that planning with an
expectation model is equivalent to planning with a distribution model if the
state value function is linear in state features, 2) analyze two common
parametrization choices for approximating the expectation: linear and
non-linear expectation models, 3) propose a sound model-based policy evaluation
algorithm and present its convergence results, and 4) empirically demonstrate
the effectiveness of the proposed planning algorithm
Recurrent Value Functions
Despite recent successes in Reinforcement Learning, value-based methods often
suffer from high variance hindering performance. In this paper, we illustrate
this in a continuous control setting where state of the art methods perform
poorly whenever sensor noise is introduced. To overcome this issue, we
introduce Recurrent Value Functions (RVFs) as an alternative to estimate the
value function of a state. We propose to estimate the value function of the
current state using the value function of past states visited along the
trajectory. Due to the nature of their formulation, RVFs have a natural way of
learning an emphasis function that selectively emphasizes important states.
First, we establish RVF's asymptotic convergence properties in tabular
settings. We then demonstrate their robustness on a partially observable domain
and continuous control tasks. Finally, we provide a qualitative interpretation
of the learned emphasis function
Geometric Insights into the Convergence of Nonlinear TD Learning
While there are convergence guarantees for temporal difference (TD) learning
when using linear function approximators, the situation for nonlinear models is
far less understood, and divergent examples are known. Here we take a first
step towards extending theoretical convergence guarantees to TD learning with
nonlinear function approximation. More precisely, we consider the expected
learning dynamics of the TD(0) algorithm for value estimation. As the step-size
converges to zero, these dynamics are defined by a nonlinear ODE which depends
on the geometry of the space of function approximators, the structure of the
underlying Markov chain, and their interaction. We find a set of function
approximators that includes ReLU networks and has geometry amenable to TD
learning regardless of environment, so that the solution performs about as well
as linear TD in the worst case. Then, we show how environments that are more
reversible induce dynamics that are better for TD learning and prove global
convergence to the true value function for well-conditioned function
approximators. Finally, we generalize a divergent counterexample to a family of
divergent problems to demonstrate how the interaction between approximator and
environment can go wrong and to motivate the assumptions needed to prove
convergence.Comment: ICLR 202
A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms
We investigate the discounting mismatch in actor-critic algorithm
implementations from a representation learning perspective. Theoretically,
actor-critic algorithms usually have discounting for both actor and critic,
i.e., there is a term in the actor update for the transition
observed at time in a trajectory and the critic is a discounted value
function. Practitioners, however, usually ignore the discounting ()
for the actor while using a discounted critic. We investigate this mismatch in
two scenarios. In the first scenario, we consider optimizing an undiscounted
objective where disappears naturally . We
then propose to interpret the discounting in critic in terms of a
bias-variance-representation trade-off and provide supporting empirical
results. In the second scenario, we consider optimizing a discounted objective
() and propose to interpret the omission of the discounting in the
actor update from an auxiliary task perspective and provide supporting
empirical results
A Geometric Perspective on Optimal Representations for Reinforcement Learning
We propose a new perspective on representation learning in reinforcement
learning based on geometric properties of the space of value functions. We
leverage this perspective to provide formal evidence regarding the usefulness
of value functions as auxiliary tasks. Our formulation considers adapting the
representation to minimize the (linear) approximation of the value function of
all stationary policies for a given environment. We show that this optimization
reduces to making accurate predictions regarding a special class of value
functions which we call adversarial value functions (AVFs). We demonstrate that
using value functions as auxiliary tasks corresponds to an expected-error
relaxation of our formulation, with AVFs a natural candidate, and identify a
close relationship with proto-value functions (Mahadevan, 2005). We highlight
characteristics of AVFs and their usefulness as auxiliary tasks in a series of
experiments on the four-room domain
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning
Temporal-Difference (TD) learning with nonlinear smooth function
approximation for policy evaluation has achieved great success in modern
reinforcement learning. It is shown that such a problem can be reformulated as
a stochastic nonconvex-strongly-concave optimization problem, which is
challenging as naive stochastic gradient descent-ascent algorithm suffers from
slow convergence. Existing approaches for this problem are based on
two-timescale or double-loop stochastic gradient algorithms, which may also
require sampling large-batch data. However, in practice, a single-timescale
single-loop stochastic algorithm is preferred due to its simplicity and also
because its step-size is easier to tune. In this paper, we propose two
single-timescale single-loop algorithms which require only one data point each
step. Our first algorithm implements momentum updates on both primal and dual
variables achieving an sample complexity, which shows the
important role of momentum in obtaining a single-timescale algorithm. Our
second algorithm improves upon the first one by applying variance reduction on
top of momentum, which matches the best known sample
complexity in existing works. Furthermore, our variance-reduction algorithm
does not require a large-batch checkpoint. Moreover, our theoretical results
for both algorithms are expressed in a tighter form of simultaneous primal and
dual side convergence.Comment: 45 pages; initial draft submitted in Feb, 202