205 research outputs found
Feature reinforcement learning using looping suffix trees
There has recently been much interest in history-based methods using suffix trees to
solve POMDPs. However, these suffix trees cannot efficiently represent environments that
have long-term dependencies. We extend the recently introduced CTΦMDP algorithm to
the space of looping suffix trees which have previously only been used in solving deterministic
POMDPs. The resulting algorithm replicates results from CTΦMDP for environments
with short term dependencies, while it outperforms LSTM-based methods on TMaze, a
deep memory environment
Feature reinforcement learning: state of the art
Feature reinforcement learning was introduced five years ago as a principled and practical approach to history-based learning. This paper examines the progress since its inception. We now have both model-based and model-free cost functions, most recently extended to the function approximation setting. Our current work is geared towards playing ATARI games using imitation learning, where we use Feature RL as a feature selection method for high-dimensional domains
Q-learning for history-based reinforcement learning
We extend the Q-learning algorithm from the Markov Decision Process
setting to problems where observations are non-Markov and do not
reveal the full state of the world i.e. to POMDPs. We do this in a
natural manner by adding l0 regularisation to the pathwise squared
Q-learning objective function and then optimise this over both a
choice of map from history to states and the resulting MDP
parameters. The optimisation procedure involves a stochastic search
over the map class nested with classical Q-learning of the
parameters. This algorithm fits perfectly into the feature
reinforcement learning framework, which chooses maps based on a
cost criteria. The cost criterion used so far for feature
reinforcement learning has been model-based and aimed at predicting
future states and rewards. Instead we directly predict the return,
which is what is needed for choosing optimal actions. Our
Q-learning criteria also lends itself immediately to a function
approximation setting where features are chosen based on the
history. This algorithm is somewhat similar to the recent line of
work on lasso temporal difference learning which aims at finding a
small feature set with which one can perform policy evaluation. The
distinction is that we aim directly for learning the Q-function of
the optimal policy and we use l0 instead of l1 regularisation. We
perform an experimental evaluation on classical benchmark domains
and find improvement in convergence speed as well as in economy of
the state representation. We also compare against MC-AIXI on the
large Pocman domain and achieve competitive performance in average
reward. We use less than half the CPU time and 36 times less
memory. Overall, our algorithm hQL provides a better combination of
computational, memory and data efficiency than existing algorithms in
this setting
Cover Tree Bayesian Reinforcement Learning
This paper proposes an online tree-based Bayesian approach for reinforcement
learning. For inference, we employ a generalised context tree model. This
defines a distribution on multivariate Gaussian piecewise-linear models, which
can be updated in closed form. The tree structure itself is constructed using
the cover tree method, which remains efficient in high dimensional spaces. We
combine the model with Thompson sampling and approximate dynamic programming to
obtain effective exploration policies in unknown environments. The flexibility
and computational simplicity of the model render it suitable for many
reinforcement learning problems in continuous state spaces. We demonstrate this
in an experimental comparison with least squares policy iteration
Approximate universal artificial intelligence and self-play learning for games
This thesis is split into two independent parts.
The first is an investigation of some practical aspects of Marcus Hutter's Universal Artificial Intelligence theory.
The main contributions are to show how a very general agent can be built and analysed using the mathematical tools of this theory.
Before the work presented in this thesis, it was an open question as to whether this theory was of any relevance to reinforcement learning practitioners.
This work suggests that it is indeed relevant and worthy of future investigation.
The second part of this thesis looks at self-play learning in two player, deterministic, adversarial turn-based games.
The main contribution is the introduction of a new technique for training the weights of a heuristic evaluation function from data collected by classical game tree search algorithms.
This method is shown to outperform previous self-play training routines based on Temporal Difference learning when applied to the game of Chess.
In particular, the main highlight was using this technique to construct a Chess program that learnt to play master level Chess by tuning a set of initially random weights from self play games
Artificial Intelligence Music Generators in Real Time Jazz Improvisation: a performer’s view
Μια αμφιλεγόμενη είσοδος γεννητριών μουσικής τεχνητής νοημοσύνης στον κόσμο της μουσικής σύνθεσης και ερμηνείας καλπάζει επί του παρόντος. Γόνιμη έρευνα που πηγάζει απο τομείς όπως η ανάκτηση πληροφοριών μουσικής, τα νευρονικά δίκτυα και η βαθιά μάθηση, μεταξύ άλλων, διαμορφώνει αυτό το μέλλον. Ενσωματωμένα και μη ενσωματωμένα συστήματα τεχνητής νοημοσύνης έχουν εισέλθει στον κόσμο της τζαζ προκειμένου να συνδημιουργήσουν ιδιωματικούς μουσικούς αυτοσχεδιασμούς. Αυτή η διπλωματική εξετάζει τους προκύπτοντες μελωδικούς αυτοσχεδιασμούς που παράγονται από τις γεννήτριες OMax, ImproteK και Djazz (OID) μέσω του φακού των στοιχείων της μουσικής και το κάνει από την άποψη ενός ερμηνευτή. Η ανάλυση βασίζεται κυρίως στην αξιολόγηση των ήδη δημοσιευμένων αποτελεσμάτων, καθώς και σε μια μελέτη περίπτωσης που πραγματοποίηθηκε κατά την ολοκλήρωση αυτής της εργασίας που περιλαμβάνει την απόδοση, την ακρόαση και την αξιολόγηση των παραγόμενων αυτοσχεδιασμών του OMax. Επίσης, η εργασία ασχολείται με φιλοσοφικά ζητήματα, με τα γνωστικά θεμέλια του συναισθήματος και του νοήματος και παρέχει μια ολοκληρωμένη ανάλυση της λειτουργικότητας του OID.A highly controversial entrance of Artificial Intelligence (AI) music generators in the world of music composition and performance is currently advancing. A fruitful research from Music Information Retrieval, Neural Networks and Deep Learning, among other areas, are shaping this future. Embodied and non-embodied AI systems have stepped into the world of jazz in order to co-create idiomatic music improvisations. But how musical these improvisations are? This dissertation looks at the resulted melodic improvisations produced by OMax, ImproteK and Djazz (OID) AI generators through the lens of the elements of music and it does so from a performer’s point of view. The analysis is based mainly on the evaluation of already published results as well as on a case study I carried out during the completion of this essay which includes performance, listening and evaluation of generated improvisations of OMax. The essay also reflects upon philosophical issues, cognitive foundations of emotion and meaning and provides a comprehensive analysis of the functionality of OID
Generic Reinforcement Learning Beyond Small MDPs
Feature reinforcement learning (FRL) is a framework within which
an agent can automatically
reduce a complex environment to a Markov Decision Process (MDP)
by finding a map which
aggregates similar histories into the states of an MDP. The
primary motivation behind this
thesis is to build FRL agents that work in practice, both for
larger environments and larger
classes of environments. We focus on empirical work targeted at
practitioners in the field of
general reinforcement learning, with theoretical results wherever
necessary.
The current state-of-the-art in FRL uses suffix trees which have
issues with large observation
spaces and long-term dependencies. We start by addressing the
issue of long-term dependency
using a class of maps known as looping suffix trees, which have
previously been used to
represent deterministic POMDPs. We show the best existing results
on the TMaze domain
and good results on larger domains that require long-term
memory.
We introduce a new value-based cost function that can be
evaluated model-free. The value-
based cost allows for smaller representations, and its model-free
nature allows for its extension
to the function approximation setting, which has computational
and representational advantages for large state spaces. We
evaluate the performance of this new cost in both the tabular and
function approximation settings on a variety of domains, and show
performance better than the state-of-the-art algorithm
MC-AIXI-CTW on the domain POCMAN.
When the environment is very large, an FRL agent needs to explore
systematically in order to
find a good representation. However, it needs a good
representation in order to perform this
systematic exploration. We decouple both by considering a
different setting, one where the
agent has access to the value of any state-action pair from an
oracle in a training phase. The
agent must learn an approximate representation of the optimal
value function. We formulate
a regression-based solution based on online learning methods to
build an such an agent. We
test this agent on the Arcade Learning Environment using a simple
class of linear function
approximators.
While we made progress on the issue of scalability, two major
issues with the FRL framework
remain: the need for a stochastic search method to minimise the
objective function and the
need to store an uncompressed history, both of which can be very
computationally demanding
- …