4,724 research outputs found
Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning
Off-policy learning is more unstable compared to on-policy learning in
reinforcement learning (RL). One reason for the instability of off-policy
learning is a discrepancy between the target () and behavior (b) policy
distributions. The discrepancy between and b distributions can be
alleviated by employing a smooth variant of the importance sampling (IS), such
as the relative importance sampling (RIS). RIS has parameter
which controls smoothness. To cope with instability, we present the first
relative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-free
algorithms in RL. In our method, the network yields a target policy (the
actor), a value function (the critic) assessing the current policy ()
using samples drawn from behavior policy. We use action value generated from
the behavior policy in reward function to train our algorithm rather than from
the target policy. We also use deep neural networks to train both actor and
critic. We evaluated our algorithm on a number of Open AI Gym benchmark
problems and demonstrate better or comparable performance to several
state-of-the-art RL baselines
Sampled Policy Gradient for Learning to Play the Game Agar.io
In this paper, a new offline actor-critic learning algorithm is introduced:
Sampled Policy Gradient (SPG). SPG samples in the action space to calculate an
approximated policy gradient by using the critic to evaluate the samples. This
sampling allows SPG to search the action-Q-value space more globally than
deterministic policy gradient (DPG), enabling it to theoretically avoid more
local optima. SPG is compared to Q-learning and the actor-critic algorithms
CACLA and DPG in a pellet collection task and a self play environment in the
game Agar.io. The online game Agar.io has become massively popular on the
internet due to intuitive game design and the ability to instantly compete
against players around the world. From the point of view of artificial
intelligence this game is also very intriguing: The game has a continuous input
and action space and allows to have diverse agents with complex strategies
compete against each other. The experimental results show that Q-Learning and
CACLA outperform a pre-programmed greedy bot in the pellet collection task, but
all algorithms fail to outperform this bot in a fighting scenario. The SPG
algorithm is analyzed to have great extendability through offline exploration
and it matches DPG in performance even in its basic form without extensive
sampling
CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning
In open-ended environments, autonomous learning agents must set their own
goals and build their own curriculum through an intrinsically motivated
exploration. They may consider a large diversity of goals, aiming to discover
what is controllable in their environments, and what is not. Because some goals
might prove easy and some impossible, agents must actively select which goal to
practice at any moment, to maximize their overall mastery on the set of
learnable goals. This paper proposes CURIOUS, an algorithm that leverages 1) a
modular Universal Value Function Approximator with hindsight learning to
achieve a diversity of goals of different kinds within a unique policy and 2)
an automated curriculum learning mechanism that biases the attention of the
agent towards goals maximizing the absolute learning progress. Agents focus
sequentially on goals of increasing complexity, and focus back on goals that
are being forgotten. Experiments conducted in a new modular-goal robotic
environment show the resulting developmental self-organization of a learning
curriculum, and demonstrate properties of robustness to distracting goals,
forgetting and changes in body properties.Comment: Accepted at ICML 201
- …