9 research outputs found
GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values
We present GradientDICE for estimating the density ratio between the state
distribution of the target policy and the sampling distribution in off-policy
reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang
et al., 2020), the state-of-the-art for estimating such density ratios. Namely,
the optimization problem in GenDICE is not a convex-concave saddle-point
problem once nonlinearity in optimization variable parameterization is
introduced to ensure positivity, so any primal-dual algorithm is not guaranteed
to converge or find the desired solution. However, such nonlinearity is
essential to ensure the consistency of GenDICE even with a tabular
representation. This is a fundamental contradiction, resulting from GenDICE's
original formulation of the optimization problem. In GradientDICE, we optimize
a different objective from GenDICE by using the Perron-Frobenius theorem and
eliminating GenDICE's use of divergence. Consequently, nonlinearity in
parameterization is not necessary for GradientDICE, which is provably
convergent under linear function approximation.Comment: ICML 202
Learning Retrospective Knowledge with Reverse Reinforcement Learning
We present a Reverse Reinforcement Learning (Reverse RL) approach for
representing retrospective knowledge. General Value Functions (GVFs) have
enjoyed great success in representing predictive knowledge, i.e., answering
questions about possible future outcomes such as "how much fuel will be
consumed in expectation if we drive from A to B?". GVFs, however, cannot answer
questions like "how much fuel do we expect a car to have given it is at B at
time ?". To answer this question, we need to know when that car had a full
tank and how that car came to B. Since such questions emphasize the influence
of possible past events on the present, we refer to their answers as
retrospective knowledge. In this paper, we show how to represent retrospective
knowledge with Reverse GVFs, which are trained via Reverse RL. We demonstrate
empirically the utility of Reverse GVFs in both representation learning and
anomaly detection.Comment: NeurIPS 202
Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation
We present the first provably convergent two-timescale off-policy
actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is
the introduction of a new critic, the emphasis critic, which is trained via
Gradient Emphasis Learning (GEM), a novel combination of the key ideas of
Gradient Temporal Difference Learning and Emphatic Temporal Difference
Learning. With the help of the emphasis critic and the canonical value function
critic, we show convergence for COF-PAC, where the critics are linear and the
actor can be nonlinear.Comment: ICML 202
A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation
Marginalized importance sampling (MIS), which measures the density ratio
between the state-action occupancy of a target policy and that of a sampling
distribution, is a promising approach for off-policy evaluation. However,
current state-of-the-art MIS methods rely on complex optimization tricks and
succeed mostly on simple toy problems. We bridge the gap between MIS and deep
reinforcement learning by observing that the density ratio can be computed from
the successor representation of the target policy. The successor representation
can be trained through deep reinforcement learning methodology and decouples
the reward optimization from the dynamics of the environment, making the
resulting algorithm stable and applicable to high-dimensional domains. We
evaluate the empirical performance of our approach on a variety of challenging
Atari and MuJoCo environments.Comment: ICML 202
Combining local and global optimization for planning and control in information space
Thesis (S.M.)--Massachusetts Institute of Technology, Computation for Design and Optimization Program, 2008.Includes bibliographical references (leaves 99-102).This thesis presents a novel algorithm, called the parametric optimized belief roadmap (POBRM), to address the problem of planning a trajectory for controlling a robot with imperfect state information under uncertainty. This question is formulated abstractly as a partially observable stochastic shortest path (POSSP) problem. We assume that the feature-based map of a region is available to assist the robot's decision-making. The POBRM is a two-phase algorithm that combines local and global optimization. In an offline phase, we construct a belief graph by probabilistically sampling points around the features that potentially provide the robot with valuable information. Each edge of the belief graph stores two transfer functions to predict the cost and the conditional covariance matrix of a final state estimate if the robot follows this edge given an initial mean and covariance. In an online phase, a sub-optimal trajectory is found by the global Dijkstra's search algorithm, which ensures the balance between exploration and exploitation. Moreover, we use the iterative linear quadratic Gaussian algorithm (iLQG) to find a locally-feedback control policy in continuous state and control spaces to traverse the sub-optimal trajectory. We show that, under some suitable technical assumptions, the error bound of a sub-optimal cost compared to the globally optimal cost can be obtained. The POBRM algorithm is not only robust to imperfect state information but also scalable to find a trajectory quickly in high-dimensional systems and environments. In addition, the POBRM algorithm is capable of answering multiple queries efficiently. We also demonstrate performance results by 2D simulation of a planar car and 3D simulation of an autonomous helicopter.by Vu Anh Huynh.S.M