thesis

Planning with neural networks and reinforcement learning

Abstract

This thesis presents the design, implementation and investigation of some predictive-planning controllers built with neural-networks and inspired by Dyna-PI architectures (Sutton, 1990). Dyna-PI architectures are planning systems based on actor-critic reinforcement learning methods and a model of the environment. The controllers are tested with a simulated robot that solves a stochastic path-finding landmark navigation task. A critical review of ideas and models proposed by the literature on problem solving, planning, reinforcement learning, and neural networks precedes the presentation of the controllers. The review isolates ideas relevant to the design of planners based on neural networks. A "neural forward planner" is implemented that, unlike the Dyna-PI architectures, is taskable in a strong sense. This planner is capable of building a "partial policy" focussed on around efficient start-goal paths, and is capable of deciding to re-plan if "unexpected" states are encountered. Planning iteratively generates "chains of predictions" starting from the current state and using the model of the environment. This model is made up by some neural networks trained to predict the next input when an action is executed. A "neural bidirectional planner" that generates trajectories backward from the goal and forward from the current state is also implemented. This planner exploits the knowledge (image) on the goal, further focuses planning around efficient start-goal paths, and produces a quicker updating of evaluations. In several experiments the generalisation capacity of neural networks proves important for learning but it also causes problems of interference. To deal with these problems a modular neural architecture is implemented, that uses a mixture of experts network for the critic, and a simple hierarchical modular network for the actor. The research also implements a simple form of neural abstract planning named "coarse planning", and investigates its strengths in terms of exploration and evaluations\u27 updating. Some experiments with coarse planning and with other controllers suggest that discounted reinforcement learning may have problems dealing with long-lasting tasks

    Similar works