17 research outputs found
Feature reinforcement learning: state of the art
Feature reinforcement learning was introduced five years ago as a principled and practical approach to history-based learning. This paper examines the progress since its inception. We now have both model-based and model-free cost functions, most recently extended to the function approximation setting. Our current work is geared towards playing ATARI games using imitation learning, where we use Feature RL as a feature selection method for high-dimensional domains
Feature reinforcement learning using looping suffix trees
There has recently been much interest in history-based methods using suffix trees to
solve POMDPs. However, these suffix trees cannot efficiently represent environments that
have long-term dependencies. We extend the recently introduced CTΦMDP algorithm to
the space of looping suffix trees which have previously only been used in solving deterministic
POMDPs. The resulting algorithm replicates results from CTΦMDP for environments
with short term dependencies, while it outperforms LSTM-based methods on TMaze, a
deep memory environment
Generic Reinforcement Learning Beyond Small MDPs
Feature reinforcement learning (FRL) is a framework within which
an agent can automatically
reduce a complex environment to a Markov Decision Process (MDP)
by finding a map which
aggregates similar histories into the states of an MDP. The
primary motivation behind this
thesis is to build FRL agents that work in practice, both for
larger environments and larger
classes of environments. We focus on empirical work targeted at
practitioners in the field of
general reinforcement learning, with theoretical results wherever
necessary.
The current state-of-the-art in FRL uses suffix trees which have
issues with large observation
spaces and long-term dependencies. We start by addressing the
issue of long-term dependency
using a class of maps known as looping suffix trees, which have
previously been used to
represent deterministic POMDPs. We show the best existing results
on the TMaze domain
and good results on larger domains that require long-term
memory.
We introduce a new value-based cost function that can be
evaluated model-free. The value-
based cost allows for smaller representations, and its model-free
nature allows for its extension
to the function approximation setting, which has computational
and representational advantages for large state spaces. We
evaluate the performance of this new cost in both the tabular and
function approximation settings on a variety of domains, and show
performance better than the state-of-the-art algorithm
MC-AIXI-CTW on the domain POCMAN.
When the environment is very large, an FRL agent needs to explore
systematically in order to
find a good representation. However, it needs a good
representation in order to perform this
systematic exploration. We decouple both by considering a
different setting, one where the
agent has access to the value of any state-action pair from an
oracle in a training phase. The
agent must learn an approximate representation of the optimal
value function. We formulate
a regression-based solution based on online learning methods to
build an such an agent. We
test this agent on the Arcade Learning Environment using a simple
class of linear function
approximators.
While we made progress on the issue of scalability, two major
issues with the FRL framework
remain: the need for a stochastic search method to minimise the
objective function and the
need to store an uncompressed history, both of which can be very
computationally demanding
Q-learning for history-based reinforcement learning
We extend the Q-learning algorithm from the Markov Decision Process
setting to problems where observations are non-Markov and do not
reveal the full state of the world i.e. to POMDPs. We do this in a
natural manner by adding l0 regularisation to the pathwise squared
Q-learning objective function and then optimise this over both a
choice of map from history to states and the resulting MDP
parameters. The optimisation procedure involves a stochastic search
over the map class nested with classical Q-learning of the
parameters. This algorithm fits perfectly into the feature
reinforcement learning framework, which chooses maps based on a
cost criteria. The cost criterion used so far for feature
reinforcement learning has been model-based and aimed at predicting
future states and rewards. Instead we directly predict the return,
which is what is needed for choosing optimal actions. Our
Q-learning criteria also lends itself immediately to a function
approximation setting where features are chosen based on the
history. This algorithm is somewhat similar to the recent line of
work on lasso temporal difference learning which aims at finding a
small feature set with which one can perform policy evaluation. The
distinction is that we aim directly for learning the Q-function of
the optimal policy and we use l0 instead of l1 regularisation. We
perform an experimental evaluation on classical benchmark domains
and find improvement in convergence speed as well as in economy of
the state representation. We also compare against MC-AIXI on the
large Pocman domain and achieve competitive performance in average
reward. We use less than half the CPU time and 36 times less
memory. Overall, our algorithm hQL provides a better combination of
computational, memory and data efficiency than existing algorithms in
this setting
Reinforcement learning with value advice
The problem we consider in this paper is reinforcement learning with value advice. In this setting, the agent is given limited access to an oracle that can tell it the expected return (value) of any state-action pair with respect to the optimal policy. The agent must use this value to learn an explicit policy that performs well in the environment. We provide an algorithm called RLAdvice, based on the imitation learning algorithm DAgger. We illustrate the effectiveness of this method in the Arcade Learning Environment on three different games, using value estimates from UCT as advice
Feature Reinforcement Learning : State of the Art
Abstract Feature reinforcement learning was introduced five years ago as a principled and practical approach to history-based learning. This paper examines the progress since its inception. We now have both model-based and model-free cost functions, most recently extended to the function approximation setting. Our current work is geared towards playing ATARI games using imitation learning, where we use Feature RL as a feature selection method for high-dimensional domains. This paper is a brief summary of the progress so far in the Feature Reinforcement Learning framework (FRL) (Hutter 2009a), along with a small section on current research. FRL focuses on the general reinforcement learning problem where an agent interacts with an environment in cycles of action, observation-reward. The goal of the agent is to maximise an aggregation of the reward. The most traditional form of this general problem constrains the observations (and rewards) to be states which satisfy the Markov property, i.e. P (o t |o 1:t−1 ) = P (o t |o t−1 ) and is called a Markov Decision Process (MDP) Feature Reinforcement Learning (Hutter 2009a) is one way of dealing with the general RL problem, by reducing it to an MDP. It aims to construct a map from the history of an agent, which is its action-observation-reward cycles so far, to an MDP state. Traditional RL methods can then be used on the derived MDP to form a policy (a mapping from these states to actions). FRL fits in the category of a history-based approach. U-tree (McCallum 1996) is a different example of the history-based approach which uses a tree-based representation of the value function where nodes are split based on a local criterion. The cost in FRL is global, maps are accepted or rejected based on an evaluation of the whole map. While the idea behind FRL is simple, there are several choices to be made. What space do we draw the maps from, and how do we pick the one that fits our data so far? In the best case, we'd like to choose a map φ from the space of all possible (computable) functions on histories, but this is intractable in practice and the choice of a smaller hypothesis class can encode useful knowledge and improve learning speed. We define a cost-function that ideally measures how well φ maps the process to an MDP. The problem of searching through the map class for the best map φ * is addressed via a stochastic search method. Taking a step back from the history-based learning problem, we can frame the general RL problem as trying to find a map from a very-high dimensional input space, namely that of all possible histories to a policy representation that allows us to perform well in the given environment. This policy representation is often in the form of a value function but it does not have to be. The model-based feature RL framework Note that this representation of a general RL problem as a problem in a very-high dimensional input space allows us to use feature RL in the traditional learning setting for feature selection in function approximation problems. Instead of features of the history, our features are now that of the MDP state. The cost function now selects for the smallest subset of features that can represent our model or the valuefunction. Our current work is on using the value-based cost both in the off-policy and on-policy setting to deal with domains within the scope of the Arcade Learning Environment (Bellemare et al. 2013). The outline of this paper is as follows. Section 1 outlines some notation and relevant background, Section 2 deals with some related work, Section 3 looks at the Cost functions that have been examined in the FRL setting so far, and summarises the successes of the method. We conclude in Section 4. Preliminaries Agent-Environment Framework. The notation and framework is taken fro
Q-learning for history-based reinforcement learning
We extend the Q-learning algorithm from the Markov Decision Process setting to problems where observations are non-Markov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding l0 regularisation to the pathwise squared Q-learning objective function and then optimise this over both a choice of map from history to states and the resulting MDP parameters. The optimisation procedure involves a stochastic search over the map class nested with classical Q-learning of the parameters. This algorithm fits perfectly into the feature reinforcement learning framework, which chooses maps based on a cost criteria. The cost criterion used so far for feature reinforcement learning has been model-based and aimed at predicting future states and rewards. Instead we directly predict the return, which is what is needed for choosing optimal actions. Our Q-learning criteria also lends itself immediately to a function approximation setting where features are chosen based on the history. This algorithm is somewhat similar to the recent line of work on lasso temporal difference learning which aims at finding a small feature set with which one can perform policy evaluation. The distinction is that we aim directly for learning the Q-function of the optimal policy and we use l0 instead of l1 regularisation. We perform an experimental evaluation on classical benchmark domains and find improvement in convergence speed as well as in economy of the state representation. We also compare against MC-AIXI on the large Pocman domain and achieve competitive performance in average reward. We use less than half the CPU time and 36 times less memory. Overall, our algorithm hQL provides a better combination of computational, memory and data efficiency than existing algorithms in this setting