7,539 research outputs found
Policy Optimization with Model-based Explorations
Model-free reinforcement learning methods such as the Proximal Policy
Optimization algorithm (PPO) have successfully applied in complex
decision-making problems such as Atari games. However, these methods suffer
from high variances and high sample complexity. On the other hand, model-based
reinforcement learning methods that learn the transition dynamics are more
sample efficient, but they often suffer from the bias of the transition
estimation. How to make use of both model-based and model-free learning is a
central problem in reinforcement learning. In this paper, we present a new
technique to address the trade-off between exploration and exploitation, which
regards the difference between model-free and model-based estimations as a
measure of exploration value. We apply this new technique to the PPO algorithm
and arrive at a new policy optimization method, named Policy Optimization with
Model-based Explorations (POME). POME uses two components to predict the
actions' target values: a model-free one estimated by Monte-Carlo sampling and
a model-based one which learns a transition model and predicts the value of the
next state. POME adds the error of these two target estimations as the
additional exploration value for each state-action pair, i.e, encourages the
algorithm to explore the states with larger target errors which are hard to
estimate. We compare POME with PPO on Atari 2600 games, and it shows that POME
outperforms PPO on 33 games out of 49 games.Comment: Accepted at AAAI-1
Learn to Interpret Atari Agents
Deep Reinforcement Learning (DeepRL) agents surpass human-level performances
in a multitude of tasks. However, the direct mapping from states to actions
makes it hard to interpret the rationale behind the decision making of agents.
In contrast to previous a-posteriori methods of visualizing DeepRL policies, we
propose an end-to-end trainable framework based on Rainbow, a representative
Deep Q-Network (DQN) agent. Our method automatically learns important regions
in the input domain, which enables characterizations of the decision making and
interpretations for non-intuitive behaviors. Hence we name it Region Sensitive
Rainbow (RS-Rainbow). RS-Rainbow utilizes a simple yet effective mechanism to
incorporate visualization ability into the learning model, not only improving
model interpretability, but leading to improved performance. Extensive
experiments on the challenging platform of Atari 2600 demonstrate the
superiority of RS-Rainbow. In particular, our agent achieves state of the art
at just 25% of the training frames. Demonstrations and code are available at
https://github.com/yz93/Learn-to-Interpret-Atari-Agents
Feature reinforcement learning: state of the art
Feature reinforcement learning was introduced five years ago as a principled and practical approach to history-based learning. This paper examines the progress since its inception. We now have both model-based and model-free cost functions, most recently extended to the function approximation setting. Our current work is geared towards playing ATARI games using imitation learning, where we use Feature RL as a feature selection method for high-dimensional domains
IGN : Implicit Generative Networks
In this work, we build recent advances in distributional reinforcement
learning to give a state-of-art distributional variant of the model based on
the IQN. We achieve this by using the GAN model's generator and discriminator
function with the quantile regression to approximate the full quantile value
for the state-action return distribution. We demonstrate improved performance
on our baseline dataset - 57 Atari 2600 games in the ALE. Also, we use our
algorithm to show the state-of-art training performance of risk-sensitive
policies in Atari games with the policy optimization and evaluation
Verifiable Reinforcement Learning via Policy Extraction
While deep reinforcement learning has successfully solved many challenging
control tasks, its real-world applicability has been limited by the inability
to ensure the safety of learned policies. We propose an approach to verifiable
reinforcement learning by training decision tree policies, which can represent
complex policies (since they are nonparametric), yet can be efficiently
verified using existing techniques (since they are highly structured). The
challenge is that decision tree policies are difficult to train. We propose
VIPER, an algorithm that combines ideas from model compression and imitation
learning to learn decision tree policies guided by a DNN policy (called the
oracle) and its Q-function, and show that it substantially outperforms two
baselines. We use VIPER to (i) learn a provably robust decision tree policy for
a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree
policy for a toy game based on Pong that provably never loses, and (iii) learn
a provably stable decision tree policy for cart-pole. In each case, the
decision tree policy achieves performance equal to that of the original DNN
policy
- …