205 research outputs found

    Feature reinforcement learning using looping suffix trees

    No full text
    There has recently been much interest in history-based methods using suffix trees to solve POMDPs. However, these suffix trees cannot efficiently represent environments that have long-term dependencies. We extend the recently introduced CTΦMDP algorithm to the space of looping suffix trees which have previously only been used in solving deterministic POMDPs. The resulting algorithm replicates results from CTΦMDP for environments with short term dependencies, while it outperforms LSTM-based methods on TMaze, a deep memory environment

    Feature reinforcement learning: state of the art

    No full text
    Feature reinforcement learning was introduced five years ago as a principled and practical approach to history-based learning. This paper examines the progress since its inception. We now have both model-based and model-free cost functions, most recently extended to the function approximation setting. Our current work is geared towards playing ATARI games using imitation learning, where we use Feature RL as a feature selection method for high-dimensional domains

    Q-learning for history-based reinforcement learning

    No full text
    We extend the Q-learning algorithm from the Markov Decision Process setting to problems where observations are non-Markov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding l0 regularisation to the pathwise squared Q-learning objective function and then optimise this over both a choice of map from history to states and the resulting MDP parameters. The optimisation procedure involves a stochastic search over the map class nested with classical Q-learning of the parameters. This algorithm fits perfectly into the feature reinforcement learning framework, which chooses maps based on a cost criteria. The cost criterion used so far for feature reinforcement learning has been model-based and aimed at predicting future states and rewards. Instead we directly predict the return, which is what is needed for choosing optimal actions. Our Q-learning criteria also lends itself immediately to a function approximation setting where features are chosen based on the history. This algorithm is somewhat similar to the recent line of work on lasso temporal difference learning which aims at finding a small feature set with which one can perform policy evaluation. The distinction is that we aim directly for learning the Q-function of the optimal policy and we use l0 instead of l1 regularisation. We perform an experimental evaluation on classical benchmark domains and find improvement in convergence speed as well as in economy of the state representation. We also compare against MC-AIXI on the large Pocman domain and achieve competitive performance in average reward. We use less than half the CPU time and 36 times less memory. Overall, our algorithm hQL provides a better combination of computational, memory and data efficiency than existing algorithms in this setting

    Cover Tree Bayesian Reinforcement Learning

    Get PDF
    This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with least squares policy iteration

    Approximate universal artificial intelligence and self-play learning for games

    Full text link
    This thesis is split into two independent parts. The first is an investigation of some practical aspects of Marcus Hutter's Universal Artificial Intelligence theory. The main contributions are to show how a very general agent can be built and analysed using the mathematical tools of this theory. Before the work presented in this thesis, it was an open question as to whether this theory was of any relevance to reinforcement learning practitioners. This work suggests that it is indeed relevant and worthy of future investigation. The second part of this thesis looks at self-play learning in two player, deterministic, adversarial turn-based games. The main contribution is the introduction of a new technique for training the weights of a heuristic evaluation function from data collected by classical game tree search algorithms. This method is shown to outperform previous self-play training routines based on Temporal Difference learning when applied to the game of Chess. In particular, the main highlight was using this technique to construct a Chess program that learnt to play master level Chess by tuning a set of initially random weights from self play games

    Artificial Intelligence Music Generators in Real Time Jazz Improvisation: a performer’s view

    Get PDF
    Μια αμφιλεγόμενη είσοδος γεννητριών μουσικής τεχνητής νοημοσύνης στον κόσμο της μουσικής σύνθεσης και ερμηνείας καλπάζει επί του παρόντος. Γόνιμη έρευνα που πηγάζει απο τομείς όπως η ανάκτηση πληροφοριών μουσικής, τα νευρονικά δίκτυα και η βαθιά μάθηση, μεταξύ άλλων, διαμορφώνει αυτό το μέλλον. Ενσωματωμένα και μη ενσωματωμένα συστήματα τεχνητής νοημοσύνης έχουν εισέλθει στον κόσμο της τζαζ προκειμένου να συνδημιουργήσουν ιδιωματικούς μουσικούς αυτοσχεδιασμούς. Αυτή η διπλωματική εξετάζει τους προκύπτοντες μελωδικούς αυτοσχεδιασμούς που παράγονται από τις γεννήτριες OMax, ImproteK και Djazz (OID) μέσω του φακού των στοιχείων της μουσικής και το κάνει από την άποψη ενός ερμηνευτή. Η ανάλυση βασίζεται κυρίως στην αξιολόγηση των ήδη δημοσιευμένων αποτελεσμάτων, καθώς και σε μια μελέτη περίπτωσης που πραγματοποίηθηκε κατά την ολοκλήρωση αυτής της εργασίας που περιλαμβάνει την απόδοση, την ακρόαση και την αξιολόγηση των παραγόμενων αυτοσχεδιασμών του OMax. Επίσης, η εργασία ασχολείται με φιλοσοφικά ζητήματα, με τα γνωστικά θεμέλια του συναισθήματος και του νοήματος και παρέχει μια ολοκληρωμένη ανάλυση της λειτουργικότητας του OID.A highly controversial entrance of Artificial Intelligence (AI) music generators in the world of music composition and performance is currently advancing. A fruitful research from Music Information Retrieval, Neural Networks and Deep Learning, among other areas, are shaping this future. Embodied and non-embodied AI systems have stepped into the world of jazz in order to co-create idiomatic music improvisations. But how musical these improvisations are? This dissertation looks at the resulted melodic improvisations produced by OMax, ImproteK and Djazz (OID) AI generators through the lens of the elements of music and it does so from a performer’s point of view. The analysis is based mainly on the evaluation of already published results as well as on a case study I carried out during the completion of this essay which includes performance, listening and evaluation of generated improvisations of OMax. The essay also reflects upon philosophical issues, cognitive foundations of emotion and meaning and provides a comprehensive analysis of the functionality of OID

    Generic Reinforcement Learning Beyond Small MDPs

    No full text
    Feature reinforcement learning (FRL) is a framework within which an agent can automatically reduce a complex environment to a Markov Decision Process (MDP) by finding a map which aggregates similar histories into the states of an MDP. The primary motivation behind this thesis is to build FRL agents that work in practice, both for larger environments and larger classes of environments. We focus on empirical work targeted at practitioners in the field of general reinforcement learning, with theoretical results wherever necessary. The current state-of-the-art in FRL uses suffix trees which have issues with large observation spaces and long-term dependencies. We start by addressing the issue of long-term dependency using a class of maps known as looping suffix trees, which have previously been used to represent deterministic POMDPs. We show the best existing results on the TMaze domain and good results on larger domains that require long-term memory. We introduce a new value-based cost function that can be evaluated model-free. The value- based cost allows for smaller representations, and its model-free nature allows for its extension to the function approximation setting, which has computational and representational advantages for large state spaces. We evaluate the performance of this new cost in both the tabular and function approximation settings on a variety of domains, and show performance better than the state-of-the-art algorithm MC-AIXI-CTW on the domain POCMAN. When the environment is very large, an FRL agent needs to explore systematically in order to find a good representation. However, it needs a good representation in order to perform this systematic exploration. We decouple both by considering a different setting, one where the agent has access to the value of any state-action pair from an oracle in a training phase. The agent must learn an approximate representation of the optimal value function. We formulate a regression-based solution based on online learning methods to build an such an agent. We test this agent on the Arcade Learning Environment using a simple class of linear function approximators. While we made progress on the issue of scalability, two major issues with the FRL framework remain: the need for a stochastic search method to minimise the objective function and the need to store an uncompressed history, both of which can be very computationally demanding
    corecore