7 research outputs found
Feature reinforcement learning using looping suffix trees
There has recently been much interest in history-based methods using suffix trees to
solve POMDPs. However, these suffix trees cannot efficiently represent environments that
have long-term dependencies. We extend the recently introduced CTΦMDP algorithm to
the space of looping suffix trees which have previously only been used in solving deterministic
POMDPs. The resulting algorithm replicates results from CTΦMDP for environments
with short term dependencies, while it outperforms LSTM-based methods on TMaze, a
deep memory environment
Value Driven Representation for Human-in-the-Loop Reinforcement Learning
Interactive adaptive systems powered by Reinforcement Learning (RL) have many
potential applications, such as intelligent tutoring systems. In such systems
there is typically an external human system designer that is creating,
monitoring and modifying the interactive adaptive system, trying to improve its
performance on the target outcomes. In this paper we focus on algorithmic
foundation of how to help the system designer choose the set of sensors or
features to define the observation space used by reinforcement learning agent.
We present an algorithm, value driven representation (VDR), that can
iteratively and adaptively augment the observation space of a reinforcement
learning agent so that is sufficient to capture a (near) optimal policy. To do
so we introduce a new method to optimistically estimate the value of a policy
using offline simulated Monte Carlo rollouts. We evaluate the performance of
our approach on standard RL benchmarks with simulated humans and demonstrate
significant improvement over prior baselines
Q-learning for history-based reinforcement learning
We extend the Q-learning algorithm from the Markov Decision Process
setting to problems where observations are non-Markov and do not
reveal the full state of the world i.e. to POMDPs. We do this in a
natural manner by adding l0 regularisation to the pathwise squared
Q-learning objective function and then optimise this over both a
choice of map from history to states and the resulting MDP
parameters. The optimisation procedure involves a stochastic search
over the map class nested with classical Q-learning of the
parameters. This algorithm fits perfectly into the feature
reinforcement learning framework, which chooses maps based on a
cost criteria. The cost criterion used so far for feature
reinforcement learning has been model-based and aimed at predicting
future states and rewards. Instead we directly predict the return,
which is what is needed for choosing optimal actions. Our
Q-learning criteria also lends itself immediately to a function
approximation setting where features are chosen based on the
history. This algorithm is somewhat similar to the recent line of
work on lasso temporal difference learning which aims at finding a
small feature set with which one can perform policy evaluation. The
distinction is that we aim directly for learning the Q-function of
the optimal policy and we use l0 instead of l1 regularisation. We
perform an experimental evaluation on classical benchmark domains
and find improvement in convergence speed as well as in economy of
the state representation. We also compare against MC-AIXI on the
large Pocman domain and achieve competitive performance in average
reward. We use less than half the CPU time and 36 times less
memory. Overall, our algorithm hQL provides a better combination of
computational, memory and data efficiency than existing algorithms in
this setting
Reinforcement Learning in Robotic Task Domains with Deictic Descriptor Representation
In the field of reinforcement learning, robot task learning in a specific environment with a Markov decision process backdrop has seen much success. But, extending these results to learning a task for an environment domain has not been as fruitful, even for advanced methodologies such as relational reinforcement learning. In our research into robot learning in environment domains, we utilize a form of deictic representation for the robot’s description of the task environment. However, the non-Markovian nature of the deictic representation leads to perceptual aliasing and conflicting actions, invalidating standard reinforcement learning algorithms. To circumvent this difficulty, several past research studies have modified and extended the Q-learning algorithm to the deictic representation case with mixed results. Taking a different tact, we introduce a learning algorithm which searches deictic policy space directly, abandoning the indirect value based methods. We apply the policy learning algorithm to several different tasks in environment domains. The results compare favorably with value based learners and existing literature results
Foundations of Trusted Autonomy
Trusted Autonomy; Automation Technology; Autonomous Systems; Self-Governance; Trusted Autonomous Systems; Design of Algorithms and Methodologie
Generic Reinforcement Learning Beyond Small MDPs
Feature reinforcement learning (FRL) is a framework within which
an agent can automatically
reduce a complex environment to a Markov Decision Process (MDP)
by finding a map which
aggregates similar histories into the states of an MDP. The
primary motivation behind this
thesis is to build FRL agents that work in practice, both for
larger environments and larger
classes of environments. We focus on empirical work targeted at
practitioners in the field of
general reinforcement learning, with theoretical results wherever
necessary.
The current state-of-the-art in FRL uses suffix trees which have
issues with large observation
spaces and long-term dependencies. We start by addressing the
issue of long-term dependency
using a class of maps known as looping suffix trees, which have
previously been used to
represent deterministic POMDPs. We show the best existing results
on the TMaze domain
and good results on larger domains that require long-term
memory.
We introduce a new value-based cost function that can be
evaluated model-free. The value-
based cost allows for smaller representations, and its model-free
nature allows for its extension
to the function approximation setting, which has computational
and representational advantages for large state spaces. We
evaluate the performance of this new cost in both the tabular and
function approximation settings on a variety of domains, and show
performance better than the state-of-the-art algorithm
MC-AIXI-CTW on the domain POCMAN.
When the environment is very large, an FRL agent needs to explore
systematically in order to
find a good representation. However, it needs a good
representation in order to perform this
systematic exploration. We decouple both by considering a
different setting, one where the
agent has access to the value of any state-action pair from an
oracle in a training phase. The
agent must learn an approximate representation of the optimal
value function. We formulate
a regression-based solution based on online learning methods to
build an such an agent. We
test this agent on the Arcade Learning Environment using a simple
class of linear function
approximators.
While we made progress on the issue of scalability, two major
issues with the FRL framework
remain: the need for a stochastic search method to minimise the
objective function and the
need to store an uncompressed history, both of which can be very
computationally demanding
Feature reinforcement learning in practice
Following a recent surge in using history-based methods for resolving perceptual aliasing in reinforcement learning, we introduce an algorithm based on the feature reinforcement learning framework called ΦMDP [13]. To create a practical algorithm we dev