13 research outputs found

    Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

    Full text link
    We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Holder continuity of rewards and transition probabilities

    Rlpy: A Value-Function-Based Reinforcement Learning Framework for Education and Research

    Get PDF
    RLPy is an object-oriented reinforcement learning software package with a focus on valuefunction-based methods using linear function approximation and discrete actions. The framework was designed for both educational and research purposes. It provides a rich library of fine-grained, easily exchangeable components for learning agents (e.g., policies or representations of value functions), facilitating recently increased specialization in reinforcement learning. RLPy is written in Python to allow fast prototyping, but is also suitable for large-scale experiments through its built-in support for optimized numerical libraries and parallelization. Code profiling, domain visualizations, and data analysis are integrated in a self-contained package available under the Modified BSD License at http://github.com/rlpy/rlpy. All of these properties allow users to compare various reinforcement learning algorithms with little effort

    Output-feedback online optimal control for a class of nonlinear systems

    Full text link
    In this paper an output-feedback model-based reinforcement learning (MBRL) method for a class of second-order nonlinear systems is developed. The control technique uses exact model knowledge and integrates a dynamic state estimator within the model-based reinforcement learning framework to achieve output-feedback MBRL. Simulation results demonstrate the efficacy of the developed method

    Automated Mixed Resolution Acyclic Tiling in Reinforcement Learning

    Get PDF
    This thesis presents novel work on how to automatically alter a Tile Coding whilst simultaneously learning to improve both the quality of an agent’s policy and its speed of learning. It also identifies the detrimental effects of transition cycles in an MDP to Reinforcement Learning and Tile Coding. Reinforcement Learning (RL) (Sutton and Barto 1998) is a popular and widely-studied ma- chine learning technique, where an agent learns a policy through continual interactions with an environment, based on performing actions and observing their rewards. In the basic RL formulation, in order to guarantee learning an optimal policy, an agent needs to visit each state in the environment at least once (and often repeatedly). For this reason the speed of learning does not scale well to complex environments with large state spaces. Tile Coding (TC) (Albus 1981) is a popular value function approximation method that is able to reduce the size of a state space through approximation. In this approach, values from one or more state features are grouped into exhaustive partitions called tiles. However, as the state space becomes more granular, there is an increase of potential reduction in the precision and quality of the policy the agent is learning. As a rule of thumb, the larger the tiles are in a tiling, the faster the agent arrives at its final policy but the lower its quality; the smaller the tiles are in a tiling, the slower the agent arrives at its final policy but the higher its quality. Furthermore, using multiple, offset tilings can improve performance without the need for smaller tiles. The guarantees that surround common RL algorithms revolve around being able to visit every state in the environment at least once. However, many implementations of these algorithms use episode roll outs and can find themselves looping through a cycle of state-action pairs. This thesis theoretically and empirically shows that if the reward of each state-action pair in this transition cycle is identical then it is possible for the agent to temporarily diverge from learning the optimal policy. These detrimental effects of transition cycles can occur at any point of learning and, therefore, RL algorithms must heed them or risk sudden, temporary lacklustre performance. Furthermore, we consider the use of TC in conjunction with RL and find that it aggravates the detrimental effects of transition cycles to learning. This is caused by tiles inducing transition cycles. Tile Coding is still an effective and efficient method of approximation when the detrimental impacts of transition cycles are avoided. This motivates us to create a novel strategy for manual tile placement called Mixed Resolution Acyclic Tiling (MRAT). MRAT is based on heuristics derived from theoretical work and empirical studies conducted in this thesis. MRAT is empirically demonstrated to be a very effective way of improving the speed and quality of learning by using a non-uniform tile placement. MRAT is then automated and is empirically shown to outperform the state-of-the-art competitors and fixed TC. Automated MRAT (AMRAT) does not require parameter tuning and therefore has no hidden costs for its use unlike its competitors
    corecore