5 research outputs found
Count-Based Exploration in Feature Space for Reinforcement Learning
We introduce a new count-based optimistic exploration algorithm for
Reinforcement Learning (RL) that is feasible in environments with
high-dimensional state-action spaces. The success of RL algorithms in these
domains depends crucially on generalisation from limited training experience.
Function approximation techniques enable RL agents to generalise in order to
estimate the value of unvisited states, but at present few methods enable
generalisation regarding uncertainty. This has prevented the combination of
scalable RL algorithms with efficient exploration strategies that drive the
agent to reduce its uncertainty. We present a new method for computing a
generalised state visit-count, which allows the agent to estimate the
uncertainty associated with any state. Our \phi-pseudocount achieves
generalisation by exploiting same feature representation of the state space
that is used for value function approximation. States that have less frequently
observed features are deemed more uncertain. The \phi-Exploration-Bonus
algorithm rewards the agent for exploring in feature space rather than in the
untransformed state space. The method is simpler and less computationally
expensive than some previous proposals, and achieves near state-of-the-art
results on high-dimensional RL benchmarks.Comment: Conference: Twenty-sixth International Joint Conference on Artificial
Intelligence (IJCAI-17), 8 pages, 1 figur
Generic Reinforcement Learning Beyond Small MDPs
Feature reinforcement learning (FRL) is a framework within which
an agent can automatically
reduce a complex environment to a Markov Decision Process (MDP)
by finding a map which
aggregates similar histories into the states of an MDP. The
primary motivation behind this
thesis is to build FRL agents that work in practice, both for
larger environments and larger
classes of environments. We focus on empirical work targeted at
practitioners in the field of
general reinforcement learning, with theoretical results wherever
necessary.
The current state-of-the-art in FRL uses suffix trees which have
issues with large observation
spaces and long-term dependencies. We start by addressing the
issue of long-term dependency
using a class of maps known as looping suffix trees, which have
previously been used to
represent deterministic POMDPs. We show the best existing results
on the TMaze domain
and good results on larger domains that require long-term
memory.
We introduce a new value-based cost function that can be
evaluated model-free. The value-
based cost allows for smaller representations, and its model-free
nature allows for its extension
to the function approximation setting, which has computational
and representational advantages for large state spaces. We
evaluate the performance of this new cost in both the tabular and
function approximation settings on a variety of domains, and show
performance better than the state-of-the-art algorithm
MC-AIXI-CTW on the domain POCMAN.
When the environment is very large, an FRL agent needs to explore
systematically in order to
find a good representation. However, it needs a good
representation in order to perform this
systematic exploration. We decouple both by considering a
different setting, one where the
agent has access to the value of any state-action pair from an
oracle in a training phase. The
agent must learn an approximate representation of the optimal
value function. We formulate
a regression-based solution based on online learning methods to
build an such an agent. We
test this agent on the Arcade Learning Environment using a simple
class of linear function
approximators.
While we made progress on the issue of scalability, two major
issues with the FRL framework
remain: the need for a stochastic search method to minimise the
objective function and the
need to store an uncompressed history, both of which can be very
computationally demanding