3 research outputs found

    Q-learning for history-based reinforcement learning

    No full text
    We extend the Q-learning algorithm from the Markov Decision Process setting to problems where observations are non-Markov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding l0 regularisation to the pathwise squared Q-learning objective function and then optimise this over both a choice of map from history to states and the resulting MDP parameters. The optimisation procedure involves a stochastic search over the map class nested with classical Q-learning of the parameters. This algorithm fits perfectly into the feature reinforcement learning framework, which chooses maps based on a cost criteria. The cost criterion used so far for feature reinforcement learning has been model-based and aimed at predicting future states and rewards. Instead we directly predict the return, which is what is needed for choosing optimal actions. Our Q-learning criteria also lends itself immediately to a function approximation setting where features are chosen based on the history. This algorithm is somewhat similar to the recent line of work on lasso temporal difference learning which aims at finding a small feature set with which one can perform policy evaluation. The distinction is that we aim directly for learning the Q-function of the optimal policy and we use l0 instead of l1 regularisation. We perform an experimental evaluation on classical benchmark domains and find improvement in convergence speed as well as in economy of the state representation. We also compare against MC-AIXI on the large Pocman domain and achieve competitive performance in average reward. We use less than half the CPU time and 36 times less memory. Overall, our algorithm hQL provides a better combination of computational, memory and data efficiency than existing algorithms in this setting

    Generic Reinforcement Learning Beyond Small MDPs

    No full text
    Feature reinforcement learning (FRL) is a framework within which an agent can automatically reduce a complex environment to a Markov Decision Process (MDP) by finding a map which aggregates similar histories into the states of an MDP. The primary motivation behind this thesis is to build FRL agents that work in practice, both for larger environments and larger classes of environments. We focus on empirical work targeted at practitioners in the field of general reinforcement learning, with theoretical results wherever necessary. The current state-of-the-art in FRL uses suffix trees which have issues with large observation spaces and long-term dependencies. We start by addressing the issue of long-term dependency using a class of maps known as looping suffix trees, which have previously been used to represent deterministic POMDPs. We show the best existing results on the TMaze domain and good results on larger domains that require long-term memory. We introduce a new value-based cost function that can be evaluated model-free. The value- based cost allows for smaller representations, and its model-free nature allows for its extension to the function approximation setting, which has computational and representational advantages for large state spaces. We evaluate the performance of this new cost in both the tabular and function approximation settings on a variety of domains, and show performance better than the state-of-the-art algorithm MC-AIXI-CTW on the domain POCMAN. When the environment is very large, an FRL agent needs to explore systematically in order to find a good representation. However, it needs a good representation in order to perform this systematic exploration. We decouple both by considering a different setting, one where the agent has access to the value of any state-action pair from an oracle in a training phase. The agent must learn an approximate representation of the optimal value function. We formulate a regression-based solution based on online learning methods to build an such an agent. We test this agent on the Arcade Learning Environment using a simple class of linear function approximators. While we made progress on the issue of scalability, two major issues with the FRL framework remain: the need for a stochastic search method to minimise the objective function and the need to store an uncompressed history, both of which can be very computationally demanding

    Safe Q-Learning on Complete History Spaces

    No full text
    Abstract. In this article, we present an idea for solving deterministic partially observable markov decision processes (POMDPs) based on a history space containing sequences of past observations and actions. A novel and sound technique for learning a Q-function on history spaces is developed and discussed. We analyze certain conditions under which a history based approach is able to learn policies comparable to the optimal solution on belief states. The algorithm presented is model-free and can be combined with any method learning history spaces. We also present a procedure able to learn history spaces especially suited for our Q-learning algorithm.
    corecore