Search CORE

5 research outputs found

Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents

Author: Jonathan Sorg
Richard L. Lewis
Satinder Singh
Publication venue
Publication date: 04/08/2011
Field of study

Planning agents often lack the computational resources needed to build full planning trees for their environments. Agent designers commonly overcome this finite-horizon approximation by applying an evaluation function at the leaf-states of the planning tree. Recent work has proposed an alternative approach for overcoming computational constraints on agent design: modify the reward function. In this work, we compare this reward design approach to the common leaf-evaluation heuristic approach for improving planning agents. We show that in many agents, the reward design approach strictly subsumes the leaf-evaluation approach, i.e., there exists a reward function for every leaf-evaluation heuristic that leads to equivalent behavior, but the converse is not true. We demonstrate that this generality leads to improved performance when an agent makes approximations in addition to the finite-horizon approximation. As part of our contribution, we extend PGRD, an online reward design algorithm, to develop reward design algorithms for Sparse Sampling and UCT, two algorithms capable of planning in large state spaces

CiteSeerX

Association for the Advancement of Artificial Intelligence: AAAI Publications

Recommended from our members

Incorporating and Learning Behavior Constraints for Sequential Decision Making

Author: Pinto Jervis
Publication venue: 'Oregon State University'
Publication date
Field of study

Writing a program that performs well in a complex environment is a challenging task. In such problems, a method of deterministic programming combined with reinforcement learning (RL) can be helpful. However, current systems either force developers to encode knowledge in very specific forms (e.g., state-action features), or assume advanced RL knowledge (e.g., ALISP). This thesis explores techniques that make it easier for developers, who may not be RL experts, to encode their knowledge in the form of behavior constraints. We begin with the framework of adaptation-based programming (ABP) for writing self-optimizing programs. Next, we show how a certain type of conditional independency called "influence information" arises naturally in ABP programs. We propose two algorithms for learning reactive policies that are capable of leveraging this knowledge. Using influence information to simplify the credit assignment problem produces significant performance improvements. Next, we turn our attention to problems in which a simulator allows us to replace reactive decision-making with time-bounded search, which often outperforms purely reactive decision-making at significant computational cost. We propose a new type of behavior constraint in the form of partial policies, which restricts behavior to a subset of good actions. Using a partial policy to prune sub-optimal actions reduces the action branching factor, thereby speeding up search. We propose three algorithms for learning partial policies offline, based on reducing the learning problem to i.i.d. supervised learning and we give a reduction-style analysis for each one. We give concrete implementations using the popular framework of Monte-Carlo tree search. Experiments on challenging problems demonstrates large performance improvements in search-based decision-making generated by the learned partial policies. Taken together, this thesis outlines a programming framework for injecting different forms of developer knowledge into reactive policy learning algorithms and search-based online planning algorithms. It represents a few small steps towards a programming paradigm that makes it easy to write programs that learn to perform well

ScholarsArchive@OSU

Deep Learning and Reward Design for Reinforcement Learning

Author: Guo Xiaoxiao
Publication venue
Publication date
Field of study

One of the fundamental problems in Artificial Intelligence is sequential decision making in a flexible environment. Reinforcement Learning (RL) gives a set of tools for solving sequential decision problems. Although the theory of RL addresses a general class of learning problems with a constructive mathematical formulation, the challenges posed by the interaction of rich perception and delayed rewards in many domains remain a significant barrier to the widespread applicability of RL methods. The rich perception problem itself has two components: 1) the sensors at any time step do not capture all the information in the history of observations, leading to partial observability, and 2) the sensors provide very high-dimensional observations, such as images and natural languages, that introduce computational and sample-complexity challenges for the representation and generalization problems in policy selection. The delayed reward problem—that the effect of actions in terms of future rewards is delayed in time—makes it hard to determine how to credit action sequences for reward outcomes. This dissertation offers a set of contributions that adapt the hierarchical representation learning power of deep learning to address rich perception in vision and text domains, and develop new reward design algorithms to address delayed rewards. The first contribution is a new learning method for deep neural networks in vision-based real-time control. The learning method distills slow policies of the Monte Carlo Tree Search (MCTS) into fast convolutional neural networks, which outperforms the conventional Deep Q-Network. The second contribution is a new end-to-end reward design algorithm to mitigate the delayed rewards for the state-of-the-art MCTS method. The reward design algorithm converts visual perceptions into reward bonuses via deep neural networks, and optimizes the network weights to improve the performance of MCTS end-to-end via policy gradient. The third contribution is to extend existing policy gradient reward design method from single task to multiple tasks. Reward bonuses learned from old tasks are transferred to new tasks to facilitate learning. The final contribution is an application of deep reinforcement learning to another type of rich perception, ambiguous texts. A synthetic data set is proposed to evaluate the querying, reasoning and question-answering abilities of RL agents, and a deep memory network architecture is applied to solve these challenging problems to substantial degrees.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/136931/1/guoxiao_1.pd

Deep Blue Documents at the University of Michigan

Reinforcement learning with supervision beyond environmental rewards

Author: Gangwani Tanmay
Publication venue
Publication date: 01/12/2021
Field of study

Reinforcement Learning (RL) is an elegant approach to tackle sequential decision-making problems. In the standard setting, the task designer curates a reward function and the RL agent's objective is to take actions in the environment such that the long-term cumulative reward is maximized. Deep RL algorithms---that combine RL principles with deep neural networks---have been successfully used to learn behaviors in complex environments but are generally quite sensitive to the nature of the reward function. For a given RL problem, the environmental rewards could be sparse, delayed, misspecified, or unavailable (i.e., impossible to define mathematically for the required behavior). These scenarios exacerbate the challenge of training a stable deep-RL agent in a sample-efficient manner. In this thesis, we study methods that go beyond a direct reliance on the environmental rewards by generating additional information signals that the RL agent could incorporate for learning the desired skills. We start by investigating the performance bottlenecks in delayed reward environments and propose to address these by learning surrogate rewards. We include two methods to compute the surrogate rewards using the agent-environment interaction data. Then, we consider the imitation-learning (IL) setting where we don't have access to any rewards, but instead, are provided with a dataset of expert demonstrations that the RL agent must learn to reliably reproduce. We propose IL algorithms for partially observable environments and situations with discrepancies between the transition dynamics of the expert and the imitator. Next, we consider the benefits of learning an ensemble of RL agents with explicit diversity pressure. We show that diversity encourages exploration and facilitates the discovery of sparse environmental rewards. Finally, we analyze the concept of sharing knowledge between RL agents operating in different but related environments and show that the information transfer can accelerate learning

Illinois Digital Environment for Access to Learning and Scholarship Repository