5,358 research outputs found
A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation
Marginalized importance sampling (MIS), which measures the density ratio
between the state-action occupancy of a target policy and that of a sampling
distribution, is a promising approach for off-policy evaluation. However,
current state-of-the-art MIS methods rely on complex optimization tricks and
succeed mostly on simple toy problems. We bridge the gap between MIS and deep
reinforcement learning by observing that the density ratio can be computed from
the successor representation of the target policy. The successor representation
can be trained through deep reinforcement learning methodology and decouples
the reward optimization from the dynamics of the environment, making the
resulting algorithm stable and applicable to high-dimensional domains. We
evaluate the empirical performance of our approach on a variety of challenging
Atari and MuJoCo environments.Comment: ICML 202
Marginalized Importance Sampling for Off-Environment Policy Evaluation
Reinforcement Learning (RL) methods are typically sample-inefficient, making
it challenging to train and deploy RL-policies in real world robots. Even a
robust policy trained in simulation, requires a real-world deployment to assess
their performance. This paper proposes a new approach to evaluate the
real-world performance of agent policies without deploying them in the real
world. The proposed approach incorporates a simulator along with real-world
offline data to evaluate the performance of any policy using the framework of
Marginalized Importance Sampling (MIS). Existing MIS methods face two
challenges: (1) large density ratios that deviate from a reasonable range and
(2) indirect supervision, where the ratio needs to be inferred indirectly, thus
exacerbating estimation error. Our approach addresses these challenges by
introducing the target policy's occupancy in the simulator as an intermediate
variable and learning the density ratio as the product of two terms that can be
learned separately. The first term is learned with direct supervision and the
second term has a small magnitude, thus making it easier to run. We analyze the
sample complexity as well as error propagation of our two step-procedure.
Furthermore, we empirically evaluate our approach on Sim2Sim environments such
as Cartpole, Reacher and Half-Cheetah. Our results show that our method
generalizes well across a variety of Sim2Sim gap, target policies and offline
data collection policies. We also demonstrate the performance of our algorithm
on a Sim2Real task of validating the performance of a 7 DOF robotic arm using
offline data along with a gazebo based arm simulator
Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning
We propose A-Crab (Actor-Critic Regularized by Average Bellman error), a new
practical algorithm for offline reinforcement learning (RL) in complex
environments with insufficient data coverage. Our algorithm combines the
marginalized importance sampling framework with the actor-critic paradigm,
where the critic returns evaluations of the actor (policy) that are pessimistic
relative to the offline data and have a small average (importance-weighted)
Bellman error. Compared to existing methods, our algorithm simultaneously
offers a number of advantages: (1) It achieves the optimal statistical rate of
-- where is the size of offline dataset -- in converging to
the best policy covered in the offline dataset, even when combined with general
function approximators. (2) It relies on a weaker average notion of policy
coverage (compared to the single-policy concentrability) that
exploits the structure of policy visitations. (3) It outperforms the
data-collection behavior policy over a wide range of specific hyperparameters.
We provide both theoretical analysis and experimental results to validate the
effectiveness of our proposed algorithm.Comment: 24 pages, 3 figure
- β¦