Search CORE

9 research outputs found

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

Author: Liu Bo
Whiteson Shimon
Zhang Shangtong
Publication venue
Publication date: 21/11/2020
Field of study

We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.Comment: ICML 202

arXiv.org e-Print Archive

Oxford University Research Archive

Learning Retrospective Knowledge with Reverse Reinforcement Learning

Author: Veeriah Vivek
Whiteson Shimon
Zhang Shangtong
Publication venue
Publication date: 01/01/2020
Field of study

We present a Reverse Reinforcement Learning (Reverse RL) approach for representing retrospective knowledge. General Value Functions (GVFs) have enjoyed great success in representing predictive knowledge, i.e., answering questions about possible future outcomes such as "how much fuel will be consumed in expectation if we drive from A to B?". GVFs, however, cannot answer questions like "how much fuel do we expect a car to have given it is at B at time

t

?". To answer this question, we need to know when that car had a full tank and how that car came to B. Since such questions emphasize the influence of possible past events on the present, we refer to their answers as retrospective knowledge. In this paper, we show how to represent retrospective knowledge with Reverse GVFs, which are trained via Reverse RL. We demonstrate empirically the utility of Reverse GVFs in both representation learning and anomaly detection.Comment: NeurIPS 202

arXiv.org e-Print Archive

Oxford University Research Archive

Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation

Author: Liu Bo
Whiteson Shimon
Yao Hengshuai
Zhang Shangtong
Publication venue
Publication date: 01/01/2020
Field of study

We present the first provably convergent two-timescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.Comment: ICML 202

arXiv.org e-Print Archive

Oxford University Research Archive

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Author: Fujimoto Scott
Meger David
Precup Doina
Publication venue
Publication date: 12/06/2021
Field of study

Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.Comment: ICML 202

arXiv.org e-Print Archive

Combining local and global optimization for planning and control in information space

Author: Huynh Vu Anh
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2008
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Computation for Design and Optimization Program, 2008.Includes bibliographical references (leaves 99-102).This thesis presents a novel algorithm, called the parametric optimized belief roadmap (POBRM), to address the problem of planning a trajectory for controlling a robot with imperfect state information under uncertainty. This question is formulated abstractly as a partially observable stochastic shortest path (POSSP) problem. We assume that the feature-based map of a region is available to assist the robot's decision-making. The POBRM is a two-phase algorithm that combines local and global optimization. In an offline phase, we construct a belief graph by probabilistically sampling points around the features that potentially provide the robot with valuable information. Each edge of the belief graph stores two transfer functions to predict the cost and the conditional covariance matrix of a final state estimate if the robot follows this edge given an initial mean and covariance. In an online phase, a sub-optimal trajectory is found by the global Dijkstra's search algorithm, which ensures the balance between exploration and exploitation. Moreover, we use the iterative linear quadratic Gaussian algorithm (iLQG) to find a locally-feedback control policy in continuous state and control spaces to traverse the sub-optimal trajectory. We show that, under some suitable technical assumptions, the error bound of a sub-optimal cost compared to the globally optimal cost can be obtained. The POBRM algorithm is not only robust to imperfect state information but also scalable to find a trajectory quickly in high-dimensional systems and environments. In addition, the POBRM algorithm is capable of answering multiple queries efficiently. We also demonstrate performance results by 2D simulation of a planar car and 3D simulation of an autonomous helicopter.by Vu Anh Huynh.S.M

DSpace@MIT