4,767 research outputs found
Reinforcement Learning with Parameterized Actions
We introduce a model-free algorithm for learning in Markov decision processes
with parameterized actions-discrete actions with continuous parameters. At each
step the agent must select both which action to use and which parameters to use
with that action. We introduce the Q-PAMDP algorithm for learning in these
domains, show that it converges to a local optimum, and compare it to direct
policy search in the goal-scoring and Platform domains.Comment: Accepted for AAAI 201
CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning
In open-ended environments, autonomous learning agents must set their own
goals and build their own curriculum through an intrinsically motivated
exploration. They may consider a large diversity of goals, aiming to discover
what is controllable in their environments, and what is not. Because some goals
might prove easy and some impossible, agents must actively select which goal to
practice at any moment, to maximize their overall mastery on the set of
learnable goals. This paper proposes CURIOUS, an algorithm that leverages 1) a
modular Universal Value Function Approximator with hindsight learning to
achieve a diversity of goals of different kinds within a unique policy and 2)
an automated curriculum learning mechanism that biases the attention of the
agent towards goals maximizing the absolute learning progress. Agents focus
sequentially on goals of increasing complexity, and focus back on goals that
are being forgotten. Experiments conducted in a new modular-goal robotic
environment show the resulting developmental self-organization of a learning
curriculum, and demonstrate properties of robustness to distracting goals,
forgetting and changes in body properties.Comment: Accepted at ICML 201
Model Learning for Look-ahead Exploration in Continuous Control
We propose an exploration method that incorporates look-ahead search over
basic learnt skills and their dynamics, and use it for reinforcement learning
(RL) of manipulation policies . Our skills are multi-goal policies learned in
isolation in simpler environments using existing multigoal RL formulations,
analogous to options or macroactions. Coarse skill dynamics, i.e., the state
transition caused by a (complete) skill execution, are learnt and are unrolled
forward during lookahead search. Policy search benefits from temporal
abstraction during exploration, though itself operates over low-level primitive
actions, and thus the resulting policies does not suffer from suboptimality and
inflexibility caused by coarse skill chaining. We show that the proposed
exploration strategy results in effective learning of complex manipulation
policies faster than current state-of-the-art RL methods, and converges to
better policies than methods that use options or parametrized skills as
building blocks of the policy itself, as opposed to guiding exploration. We
show that the proposed exploration strategy results in effective learning of
complex manipulation policies faster than current state-of-the-art RL methods,
and converges to better policies than methods that use options or parameterized
skills as building blocks of the policy itself, as opposed to guiding
exploration.Comment: This is a pre-print of our paper which is accepted in AAAI 201
DAC: The Double Actor-Critic Architecture for Learning Options
We reformulate the option framework as two parallel augmented MDPs. Under
this novel formulation, all policy optimization algorithms can be used off the
shelf to learn intra-option policies, option termination conditions, and a
master policy over options. We apply an actor-critic algorithm on each
augmented MDP, yielding the Double Actor-Critic (DAC) architecture.
Furthermore, we show that, when state-value functions are used as critics, one
critic can be expressed in terms of the other, and hence only one critic is
necessary. We conduct an empirical study on challenging robot simulation tasks.
In a transfer learning setting, DAC outperforms both its hierarchy-free
counterpart and previous gradient-based option learning algorithms.Comment: NeurIPS 201
- …