14 research outputs found

    Finite-Sample Analysis of Bellman Residual Minimization

    Get PDF
    International audienceWe consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual defined on a set of n states drawn i.i.d. from a distribution, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual with a rate of order O(1/sqrt((n)). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions.

    Batch Policy Learning under Constraints

    Get PDF
    When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting

    Boosted Fitted Q-Iteration

    Get PDF
    International audienceThis paper is about the study of B-FQI, an Approximated Value Iteration (AVI) algorithm that exploits a boosting procedure to estimate the action-value function in reinforcement learning problems. B-FQI is an iterative off-line algorithm that, given a dataset of transitions, builds an approximation of the optimal action-value function by summing the approximations of the Bell-man residuals across all iterations. The advantage of such approach w.r.t. to other AVI methods is twofold: (1) while keeping the same function space at each iteration, B-FQI can represent more complex functions by considering an additive model; (2) since the Bellman residual decreases as the optimal value function is approached , regression problems become easier as iterations proceed. We study B-FQI both theoretically , providing also a finite-sample error upper bound for it, and empirically, by comparing its performance to the one of FQI in different domains and using different regression techniques

    Reinforcement learning in continuous state and action spaces

    Get PDF
    Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically

    Linear Reinforcement Learning with Options

    Get PDF
    The thesis deals with linear approaches to the Markov Decision Process (MDP). In particular, we describe Policy Evaluation (PE) methods and Value Iteration (VI) methods that work with representations of MDPs that are compressed using a linear operator. We then use these methods in the context of the options framework, which is way of employing temporal abstraction to speed up MDP solving. The main novel contributions are: the analysis of convergence of the linear compression framework, a condition for when a linear compression framework is optimal, an in-depth analysis of the LSTD algorithm, the formulation of value iteration with options in the linear framework and the combination of linear state aggregation and options

    Regularized approximate policy iteration using kernel for on-line reinforcement learning

    Get PDF
    By using Reinforcement Learning (RL), an autonomous agent interacting with the environment can learn how to take adequate actions for every situation in order to optimally achieve its own goal. RL provides a general methodology able to solve uncertain and complex decision problems which may be present in many real-world applications. RL problems are usually modeled as a Markov Decision Processes (MDPs) deeply studied in the literature. The main peculiarity of a RL algorithm is that the RL agent is assumed to learn the optimal policies from its experiences without knowing the parameters of the MDP. The key element in solving the MDP is learning a value function which gives the expectation of total reward an agent might expect at its current state taking a given action. This value function allows to obtain the optimal policy. In this thesis we study the capacity of SVR using kernel methods to adapt and solve complex RL problems in large or continuous state space. SVR can be studied using a geometrical interpretation in terms of optimal margin or can be seen as a regularization problem given in a Reproducing Kernel Hilbert Space (RKHS) SVR have good properties over the generalization ability and as they are based a on convex optimization problem, they do not suffer from sub-optimality. SVR are non-parametric showing the ability to automatically adapt to the complexity of the problem. Accordingly, applying SVR to approximate value functions sounds to be a good approach. SVR can be solved both in batch mode when the whole set of training sample are at disposal of the learning agents or incrementally which enables the addition or removal of training samples very effectively. Incremental SVR finds the appropriate KKT conditions for new or updated data by modifying their influences into the regression function maintaining consistence in the KKT conditions for the rest of data used for learning. In RL problems an incremental SVR should be able to approximate the action value function leading to the optimal policy. Accordingly, computation load should be lower, learning speed faster and generalization more effective than other existing method The overall contribution coming from of our work is to develop, formalize, implement and study a new RL technique for generalization in discrete and continuous state spaces with finite actions. Our method uses the Approximate Policy Iteration (API) framework with the BRM criterion which allows to represent the action value function using SVR. This approach for RL is the first one we know using SVR compatible to the agent interaction- with-the-environment framework of RL which shows his power by solving a large number of benchmark problems, including very difficult ones, like the bicycle driving and riding control problem. In addition, unlike most RL approaches to generalization, we develop a proof finding theoretical bounds for the convergence of the method to the optimal solution under given conditions.Mediante el uso de aprendizaje por refuerzo (RL), un agente aut贸nomo interactuando con el medio ambiente puede aprender a tomar adecuada acciones para cada situaci贸n con el fin de lograr de manera 贸ptima su propia meta. RL proporciona una metodolog铆a general capaz de resolver problemas de decisi贸n complejos que pueden estar presentes en muchas aplicaciones del mundo real. Problemas RL usualmente se modelan como una Procesos de Decisi贸n de Markov (MDP) estudiados profundamente en la literatura. La principal peculiaridad de un algoritmo de RL es que el agente es asumido para aprender las pol铆ticas 贸ptimas de sus experiencias sin saber los par谩metros de la MDP. El elemento clave en resolver el MDP est谩 en el aprender una funci贸n de valor que da la expectativa de recompensa total que un agente puede esperar en su estado actual para tomar una acci贸n determinada. Esta funci贸n de valor permite obtener la pol铆tica 贸ptima. En esta tesis se estudia la capacidad del SVR utilizando n煤cleo m茅todos para adaptarse y resolver problemas RL complejas en el espacio estado grande o continua. RVS puede ser estudiado mediante un interpretaci贸n geom茅trica en t茅rminos de margen 贸ptimo o puede ser visto como un problema de regularizaci贸n dado en un Reproducing Kernel Hilbert Space (RKHS). SVR tiene buenas propiedades sobre la capacidad de generalizaci贸n y ya que se basan en una optimizaci贸n convexa problema, ellos no sufren de sub-optimalidad. SVR son no param茅trico que muestra la capacidad de adaptarse autom谩ticamente a la complejidad del problema. En consecuencia, la aplicaci贸n de RVS para aproximar funciones de valor suena para ser un buen enfoque. SVR puede resolver tanto en modo batch cuando todo el conjunto de muestra de entrenamiento est谩n a disposici贸n de los agentes de aprendizaje o incrementalmente que permite la adici贸n o eliminaci贸n de muestras de entrenamiento muy eficaz. Incremental SVR encuentra las condiciones adecuadas para KKT nuevas o actualizadas de datos modificando sus influencias en la funci贸n de regresi贸n mantener consistencia en las condiciones KKT para el resto de los datos utilizados para el aprendizaje. En los problemas de RL una RVS elemental ser谩 capaz de aproximar la funci贸n de valor de acci贸n que conduce a la pol铆tica 贸ptima. En consecuencia, la carga de c谩lculo deber铆a ser menor, la velocidad de aprendizaje m谩s r谩pido y generalizaci贸n m谩s efectivo que el otro m茅todo existente La contribuci贸n general que viene de nuestro trabajo es desarrollar, formalizar, ejecutar y estudiar una nueva t茅cnica de RL para la generalizaci贸n en espacio de estados discretos y continuos con acciones finitas. Nuestro m茅todo utiliza el marco de la Approximate Policy Iteration (API) con el criterio de BRM que permite representar la funci贸n de valor de acci贸n utilizando SVR. Este enfoque de RL es el primero que conocemos usando SVR compatible con el marco de RL con agentes interaccionado con el ambiente que muestra su poder mediante la resoluci贸n de un gran n煤mero de problemas de referencia, incluyendo los muy dif铆ciles, como la conducci贸n de bicicletas y problema de control de conducci贸n. Adem谩s, a diferencia de la mayor铆a RL se acerca a la generalizaci贸n, desarrollamos un hallazgo prueba l铆mites te贸ricos para la convergencia del m茅todo a la soluci贸n 贸ptima en condiciones dadas.Postprint (published version
    corecore