838 research outputs found
Deep Policies for Width-Based Planning in Pixel Domains
Width-based planning has demonstrated great success in recent years due to
its ability to scale independently of the size of the state space. For example,
Bandres et al. (2018) introduced a rollout version of the Iterated Width
algorithm whose performance compares well with humans and learning methods in
the pixel setting of the Atari games suite. In this setting, planning is done
on-line using the "screen" states and selecting actions by looking ahead into
the future. However, this algorithm is purely exploratory and does not leverage
past reward information. Furthermore, it requires the state to be factored into
features that need to be pre-defined for the particular task, e.g., the B-PROST
pixel features. In this work, we extend width-based planning by incorporating
an explicit policy in the action selection mechanism. Our method, called
-IW, interleaves width-based planning and policy learning using the
state-actions visited by the planner. The policy estimate takes the form of a
neural network and is in turn used to guide the planning step, thus reinforcing
promising paths. Surprisingly, we observe that the representation learned by
the neural network can be used as a feature space for the width-based planner
without degrading its performance, thus removing the requirement of pre-defined
features for the planner. We compare -IW with previous width-based methods
and with AlphaZero, a method that also interleaves planning and learning, in
simple environments, and show that -IW has superior performance. We also
show that -IW algorithm outperforms previous width-based methods in the
pixel setting of Atari games suite.Comment: In Proceedings of the 29th International Conference on Automated
Planning and Scheduling (ICAPS 2019). arXiv admin note: text overlap with
arXiv:1806.0589
Learning Classical Planning Strategies with Policy Gradient
A common paradigm in classical planning is heuristic forward search. Forward
search planners often rely on simple best-first search which remains fixed
throughout the search process. In this paper, we introduce a novel search
framework capable of alternating between several forward search approaches
while solving a particular planning problem. Selection of the approach is
performed using a trainable stochastic policy, mapping the state of the search
to a probability distribution over the approaches. This enables using policy
gradient to learn search strategies tailored to a specific distributions of
planning problems and a selected performance metric, e.g. the IPC score. We
instantiate the framework by constructing a policy space consisting of five
search approaches and a two-dimensional representation of the planner's state.
Then, we train the system on randomly generated problems from five IPC domains
using three different performance metrics. Our experimental results show that
the learner is able to discover domain-specific search strategies, improving
the planner's performance relative to the baselines of plain best-first search
and a uniform policy.Comment: Accepted for ICAPS 201
- …