3,284 research outputs found
Transfer of Deep Reactive Policies for MDP Planning
Domain-independent probabilistic planners input an MDP description in a
factored representation language such as PPDDL or RDDL, and exploit the
specifics of the representation for faster planning. Traditional algorithms
operate on each problem instance independently, and good methods for
transferring experience from policies of other instances of a domain to a new
instance do not exist. Recently, researchers have begun exploring the use of
deep reactive policies, trained via deep reinforcement learning (RL), for MDP
planning domains. One advantage of deep reactive policies is that they are more
amenable to transfer learning.
In this paper, we present the first domain-independent transfer algorithm for
MDP planning domains expressed in an RDDL representation. Our architecture
exploits the symbolic state configuration and transition function of the domain
(available via RDDL) to learn a shared embedding space for states and
state-action pairs for all problem instances of a domain. We then learn an RL
agent in the embedding space, making a near zero-shot transfer possible, i.e.,
without much training on the new instance, and without using the domain
simulator at all. Experiments on three different benchmark domains underscore
the value of our transfer algorithm. Compared against planning from scratch,
and a state-of-the-art RL transfer algorithm, our transfer solution has
significantly superior learning curves.Comment: To appear at NIPS 201
Action Schema Networks: Generalised Policies with Deep Learning
In this paper, we introduce the Action Schema Network (ASNet): a neural
network architecture for learning generalised policies for probabilistic
planning problems. By mimicking the relational structure of planning problems,
ASNets are able to adopt a weight-sharing scheme which allows the network to be
applied to any problem from a given planning domain. This allows the cost of
training the network to be amortised over all problems in that domain. Further,
we propose a training method which balances exploration and supervised training
on small problems to produce a policy which remains robust when evaluated on
larger problems. In experiments, we show that ASNet's learning capability
allows it to significantly outperform traditional non-learning planners in
several challenging domains.Comment: Accepted to AAAI 201
Autonomous Extracting a Hierarchical Structure of Tasks in Reinforcement Learning and Multi-task Reinforcement Learning
Reinforcement learning (RL), while often powerful, can suffer from slow
learning speeds, particularly in high dimensional spaces. The autonomous
decomposition of tasks and use of hierarchical methods hold the potential to
significantly speed up learning in such domains. This paper proposes a novel
practical method that can autonomously decompose tasks, by leveraging
association rule mining, which discovers hidden relationship among entities in
data mining. We introduce a novel method called ARM-HSTRL (Association Rule
Mining to extract Hierarchical Structure of Tasks in Reinforcement Learning).
It extracts temporal and structural relationships of sub-goals in RL, and
multi-task RL. In particular,it finds sub-goals and relationship among them. It
is shown the significant efficiency and performance of the proposed method in
two main topics of RL
Reinforcement Learning for Heterogeneous Teams with PALO Bounds
We introduce reinforcement learning for heterogeneous teams in which rewards
for an agent are additively factored into local costs, stimuli unique to each
agent, and global rewards, those shared by all agents in the domain. Motivating
domains include coordination of varied robotic platforms, which incur different
costs for the same action, but share an overall goal. We present two templates
for learning in this setting with factored rewards: a generalization of
Perkins' Monte Carlo exploring starts for POMDPs to canonical MPOMDPs, with a
single policy mapping joint observations of all agents to joint actions
(MCES-MP); and another with each agent individually mapping joint observations
to their own action (MCES-FMP). We use probably approximately local optimal
(PALO) bounds to analyze sample complexity, instantiating these templates to
PALO learning. We promote sample efficiency by including a policy space pruning
technique, and evaluate the approaches on three domains of heterogeneous agents
demonstrating that MCES-FMP yields improved policies in less samples compared
to MCES-MP and a previous benchmark
Learning to Cooperate via Policy Search
Cooperative games are those in which both agents share the same payoff
structure. Value-based reinforcement-learning algorithms, such as variants of
Q-learning, have been applied to learning cooperative games, but they only
apply when the game state is completely observable to both agents. Policy
search methods are a reasonable alternative to value-based methods for
partially observable environments. In this paper, we provide a gradient-based
distributed policy-search method for cooperative games and compare the notion
of local optimum to that of Nash equilibrium. We demonstrate the effectiveness
of this method experimentally in a small, partially observable simulated soccer
domain.Comment: 8 pages, 5 figure
Factored Contextual Policy Search with Bayesian Optimization
Scarce data is a major challenge to scaling robot learning to truly complex
tasks, as we need to generalize locally learned policies over different task
contexts. Contextual policy search offers data-efficient learning and
generalization by explicitly conditioning the policy on a parametric context
space. In this paper, we further structure the contextual policy
representation. We propose to factor contexts into two components: target
contexts that describe the task objectives, e.g. target position for throwing a
ball; and environment contexts that characterize the environment, e.g. initial
position or mass of the ball. Our key observation is that experience can be
directly generalized over target contexts. We show that this can be easily
exploited in contextual policy search algorithms. In particular, we apply
factorization to a Bayesian optimization approach to contextual policy search
both in sampling-based and active learning settings. Our simulation results
show faster learning and better generalization in various robotic domains. See
our supplementary video: https://youtu.be/MNTbBAOufDY.Comment: To appear in ICRA 201
Urban Driving with Multi-Objective Deep Reinforcement Learning
Autonomous driving is a challenging domain that entails multiple aspects: a
vehicle should be able to drive to its destination as fast as possible while
avoiding collision, obeying traffic rules and ensuring the comfort of
passengers. In this paper, we present a deep learning variant of thresholded
lexicographic Q-learning for the task of urban driving. Our multi-objective DQN
agent learns to drive on multi-lane roads and intersections, yielding and
changing lanes according to traffic rules. We also propose an extension for
factored Markov Decision Processes to the DQN architecture that provides
auxiliary features for the Q function. This is shown to significantly improve
data efficiency. We then show that the learned policy is able to zero-shot
transfer to a ring road without sacrificing performance.Comment: Accepted at AAMAS 201
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
In this work, we propose to apply trust region optimization to deep
reinforcement learning using a recently proposed Kronecker-factored
approximation to the curvature. We extend the framework of natural policy
gradient and propose to optimize both the actor and the critic using
Kronecker-factored approximate curvature (K-FAC) with trust region; hence we
call our method Actor Critic using Kronecker-Factored Trust Region (ACKTR). To
the best of our knowledge, this is the first scalable trust region natural
gradient method for actor-critic methods. It is also a method that learns
non-trivial tasks in continuous control as well as discrete control policies
directly from raw pixel inputs. We tested our approach across discrete domains
in Atari games as well as continuous domains in the MuJoCo environment. With
the proposed methods, we are able to achieve higher rewards and a 2- to 3-fold
improvement in sample efficiency on average, compared to previous
state-of-the-art on-policy actor-critic methods. Code is available at
https://github.com/openai/baselinesComment: 14 pages, 9 figures; update github repo lin
Open Problems in Universal Induction & Intelligence
Specialized intelligent systems can be found everywhere: finger print,
handwriting, speech, and face recognition, spam filtering, chess and other game
programs, robots, et al. This decade the first presumably complete mathematical
theory of artificial intelligence based on universal
induction-prediction-decision-action has been proposed. This
information-theoretic approach solidifies the foundations of inductive
inference and artificial intelligence. Getting the foundations right usually
marks a significant progress and maturing of a field. The theory provides a
gold standard and guidance for researchers working on intelligent algorithms.
The roots of universal induction have been laid exactly half-a-century ago and
the roots of universal intelligence exactly one decade ago. So it is timely to
take stock of what has been achieved and what remains to be done. Since there
are already good recent surveys, I describe the state-of-the-art only in
passing and refer the reader to the literature. This article concentrates on
the open problems in universal induction and its extension to universal
intelligence.Comment: 32 LaTeX page
ASNets: Deep Learning for Generalised Planning
In this paper, we discuss the learning of generalised policies for
probabilistic and classical planning problems using Action Schema Networks
(ASNets). The ASNet is a neural network architecture that exploits the
relational structure of (P)PDDL planning problems to learn a common set of
weights that can be applied to any problem in a domain. By mimicking the
actions chosen by a traditional, non-learning planner on a handful of small
problems in a domain, ASNets are able to learn a generalised reactive policy
that can quickly solve much larger instances from the domain. This work extends
the ASNet architecture to make it more expressive, while still remaining
invariant to a range of symmetries that exist in PPDDL problems. We also
present a thorough experimental evaluation of ASNets, including a comparison
with heuristic search planners on seven probabilistic and deterministic
domains, an extended evaluation on over 18,000 Blocksworld instances, and an
ablation study. Finally, we show that sparsity-inducing regularisation can
produce ASNets that are compact enough for humans to understand, yielding
insights into how the structure of ASNets allows them to generalise across a
domain.Comment: Journal extension of AAAI'18 paper (arXiv:1709.04271
- …