Search CORE

11 research outputs found

A reinforcement learning ticket-based probing path discovery scheme for MANETs

Author: Chen
Chen
Doshi
Gelenbe
Gelenbe
J. Barria
Sivakumar
Sutton
W. Usaha
Publication venue: 'Elsevier BV'
Publication date: 31/07/2004
Field of study

Crossref

Spiral - Imperial College Digital Repository

A cultural algorithm for pomdps from stochastic inventory control

Author: Hnich B.
Prestwich S.D.
Rossi R.
Tarim S.A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Abstract. Reinforcement Learning algorithms such as SARSA with an eligibility trace, and Evolutionary Computation methods such as genetic algorithms, are competing approaches to solving Partially Observable Markov Decision Processes (POMDPs) which occur in many fields of Artificial Intelligence. A powerful form of evolutionary algorithm that has not previously been applied to POMDPs is the cultural algorithm, in which evolving agents share knowledge in a belief space that is used to guide their evolution. We describe a cultural algorithm for POMDPs that hybridises SARSA with a noisy genetic algorithm, and inherits the latter’s convergence properties. Its belief space is a common set of state-action values that are updated during genetic exploration, and conversely used to modify chromosomes. We use it to solve problems from stochastic inventory control by finding memoryless policies for nondeterministic POMDPs. Neither SARSA nor the genetic algorithm dominates the other on these problems, but the cultural algorithm outperforms the genetic algorithm, and on highly non-Markovian instances also outperforms SARSA.

CiteSeerX

Edinburgh Research Explorer

Wageningen University & Research Publications

Recommended from our members

Selective Attention as an Example of Building Representations within Reinforcement Learning

Author: Canas Fabian Francisco
Publication venue: CU Scholar
Publication date: 01/01/2011
Field of study

Humans demonstrate an incredible capacity to learn novel tasks in complex dynamic environments. Reinforcement learning (RL) has shown promise as a computational framework for modeling the learning of dynamic tasks in a biologically plausible way. However the learning performance of RL depends critically on the representation of the task. In the machine learning literature, representations are carefully crafted to capture the structure of the task, whereas humans autonomously construct representations during learning. In this work I present a framework integrating RL with psychological mechanisms of representation learning. One model presented here, Q-ALCOVE, explores how RL can adapt selective attention among stimulus dimensions to construct a representations in two different tasks. The model proposes that selective attention can be learned indirectly via internal feedback signals central to RL. I present the results of a behavioral experiment supporting this prediction as well as modeling work suggesting a broad psychological scope for RL

CU Scholar Institutional Repository

Recommended from our members

Incorporating and Learning Behavior Constraints for Sequential Decision Making

Author: Pinto Jervis
Publication venue: 'Oregon State University'
Publication date
Field of study

Writing a program that performs well in a complex environment is a challenging task. In such problems, a method of deterministic programming combined with reinforcement learning (RL) can be helpful. However, current systems either force developers to encode knowledge in very specific forms (e.g., state-action features), or assume advanced RL knowledge (e.g., ALISP). This thesis explores techniques that make it easier for developers, who may not be RL experts, to encode their knowledge in the form of behavior constraints. We begin with the framework of adaptation-based programming (ABP) for writing self-optimizing programs. Next, we show how a certain type of conditional independency called "influence information" arises naturally in ABP programs. We propose two algorithms for learning reactive policies that are capable of leveraging this knowledge. Using influence information to simplify the credit assignment problem produces significant performance improvements. Next, we turn our attention to problems in which a simulator allows us to replace reactive decision-making with time-bounded search, which often outperforms purely reactive decision-making at significant computational cost. We propose a new type of behavior constraint in the form of partial policies, which restricts behavior to a subset of good actions. Using a partial policy to prune sub-optimal actions reduces the action branching factor, thereby speeding up search. We propose three algorithms for learning partial policies offline, based on reducing the learning problem to i.i.d. supervised learning and we give a reduction-style analysis for each one. We give concrete implementations using the popular framework of Monte-Carlo tree search. Experiments on challenging problems demonstrates large performance improvements in search-based decision-making generated by the learned partial policies. Taken together, this thesis outlines a programming framework for injecting different forms of developer knowledge into reactive policy learning algorithms and search-based online planning algorithms. It represents a few small steps towards a programming paradigm that makes it easy to write programs that learn to perform well

ScholarsArchive@OSU

Generic Reinforcement Learning Beyond Small MDPs

Author: Daswani Mayank
Publication venue
Publication date
Field of study

Feature reinforcement learning (FRL) is a framework within which an agent can automatically reduce a complex environment to a Markov Decision Process (MDP) by finding a map which aggregates similar histories into the states of an MDP. The primary motivation behind this thesis is to build FRL agents that work in practice, both for larger environments and larger classes of environments. We focus on empirical work targeted at practitioners in the field of general reinforcement learning, with theoretical results wherever necessary. The current state-of-the-art in FRL uses suffix trees which have issues with large observation spaces and long-term dependencies. We start by addressing the issue of long-term dependency using a class of maps known as looping suffix trees, which have previously been used to represent deterministic POMDPs. We show the best existing results on the TMaze domain and good results on larger domains that require long-term memory. We introduce a new value-based cost function that can be evaluated model-free. The value- based cost allows for smaller representations, and its model-free nature allows for its extension to the function approximation setting, which has computational and representational advantages for large state spaces. We evaluate the performance of this new cost in both the tabular and function approximation settings on a variety of domains, and show performance better than the state-of-the-art algorithm MC-AIXI-CTW on the domain POCMAN. When the environment is very large, an FRL agent needs to explore systematically in order to find a good representation. However, it needs a good representation in order to perform this systematic exploration. We decouple both by considering a different setting, one where the agent has access to the value of any state-action pair from an oracle in a training phase. The agent must learn an approximate representation of the optimal value function. We formulate a regression-based solution based on online learning methods to build an such an agent. We test this agent on the Arcade Learning Environment using a simple class of linear function approximators. While we made progress on the issue of scalability, two major issues with the FRL framework remain: the need for a stochastic search method to minimise the objective function and the need to store an uncompressed history, both of which can be very computationally demanding

The Australian National University

Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Author: Aberdeen Douglas
Publication venue
Publication date: 01/01/2003
Field of study

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..

The Australian National University

Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Author: Aberdeen Douglas
Publication venue
Publication date: 01/01/2003
Field of study

The Australian National University

National Taiwan University Repository