14 research outputs found

    Vanishing Bias Heuristic-guided Reinforcement Learning Algorithm

    Full text link
    Reinforcement Learning has achieved tremendous success in the many Atari games. In this paper we explored with the lunar lander environment and implemented classical methods including Q-Learning, SARSA, MC as well as tiling coding. We also implemented Neural Network based methods including DQN, Double DQN, Clipped DQN. On top of these, we proposed a new algorithm called Heuristic RL which utilizes heuristic to guide the early stage training while alleviating the introduced human bias. Our experiments showed promising results for our proposed methods in the lunar lander environment.Comment: Robotics;Reinforcement Learning

    Automated Mixed Resolution Acyclic Tiling in Reinforcement Learning

    Get PDF
    This thesis presents novel work on how to automatically alter a Tile Coding whilst simultaneously learning to improve both the quality of an agent鈥檚 policy and its speed of learning. It also identifies the detrimental effects of transition cycles in an MDP to Reinforcement Learning and Tile Coding. Reinforcement Learning (RL) (Sutton and Barto 1998) is a popular and widely-studied ma- chine learning technique, where an agent learns a policy through continual interactions with an environment, based on performing actions and observing their rewards. In the basic RL formulation, in order to guarantee learning an optimal policy, an agent needs to visit each state in the environment at least once (and often repeatedly). For this reason the speed of learning does not scale well to complex environments with large state spaces. Tile Coding (TC) (Albus 1981) is a popular value function approximation method that is able to reduce the size of a state space through approximation. In this approach, values from one or more state features are grouped into exhaustive partitions called tiles. However, as the state space becomes more granular, there is an increase of potential reduction in the precision and quality of the policy the agent is learning. As a rule of thumb, the larger the tiles are in a tiling, the faster the agent arrives at its final policy but the lower its quality; the smaller the tiles are in a tiling, the slower the agent arrives at its final policy but the higher its quality. Furthermore, using multiple, offset tilings can improve performance without the need for smaller tiles. The guarantees that surround common RL algorithms revolve around being able to visit every state in the environment at least once. However, many implementations of these algorithms use episode roll outs and can find themselves looping through a cycle of state-action pairs. This thesis theoretically and empirically shows that if the reward of each state-action pair in this transition cycle is identical then it is possible for the agent to temporarily diverge from learning the optimal policy. These detrimental effects of transition cycles can occur at any point of learning and, therefore, RL algorithms must heed them or risk sudden, temporary lacklustre performance. Furthermore, we consider the use of TC in conjunction with RL and find that it aggravates the detrimental effects of transition cycles to learning. This is caused by tiles inducing transition cycles. Tile Coding is still an effective and efficient method of approximation when the detrimental impacts of transition cycles are avoided. This motivates us to create a novel strategy for manual tile placement called Mixed Resolution Acyclic Tiling (MRAT). MRAT is based on heuristics derived from theoretical work and empirical studies conducted in this thesis. MRAT is empirically demonstrated to be a very effective way of improving the speed and quality of learning by using a non-uniform tile placement. MRAT is then automated and is empirically shown to outperform the state-of-the-art competitors and fixed TC. Automated MRAT (AMRAT) does not require parameter tuning and therefore has no hidden costs for its use unlike its competitors

    Fair and Scalable Orchestration of Network and Compute Resources for Virtual Edge Services

    Get PDF
    The combination of service virtualization and edge computing allows for low latency services, while keeping data storage and processing local. However, given the limited resources available at the edge, a conflict in resource usage arises when both virtualized user applications and network functions need to be supported. Further, the concurrent resource request by user applications and network functions is often entangled, since the data generated by the former has to be transferred by the latter, and vice versa. In this paper, we first show through experimental tests the correlation between a video-based application and a vRAN. Then, owing to the complex involved dynamics, we develop a scalable reinforcement learning framework for resource orchestration at the edge, which leverages a Pareto analysis for provable fair and efficient decisions. We validate our framework, named VERA, through a real-time proof-of-concept implementation, which we also use to obtain datasets reporting real-world operational conditions and performance. Using such experimental datasets, we demonstrate that VERA meets the KPI targets for over 96% of the observation period and performs similarly when executed in our real-time implementation, with KPI differences below 12.4%. Further, its scaling cost is 54% lower than a centralized framework based on deep-Q networks

    Reinforcement learning in a multi-agent framework for pedestrian simulation

    Get PDF
    El objetivo de la tesis consiste en la utilizaci贸n de Aprendizaje por refuerzo (Reinforcement Learning) para generar simulaciones plausibles de peatones en diferentes entornos. Metodolog铆a Se ha desarrollado un marco de trabajo multi-agente donde cada agente virtual que aprende un comportamiento de navegaci贸n por interacci贸n con el mundo virtual en el que se encuentra junto con el resto de agentes. El mundo virtual es simulado con un motor f铆sico (ODE) que est谩 calibrado con par谩metros de peatones humanos extra铆dos de la bibliograf铆a de la materia. El marco de trabajo es flexible y permite utilizar diferentes algoritmos de aprendizaje (en concreto Q-Learning y Sarsa(lambda) en combinaci贸n con diferentes t茅cnicas de generalizaci贸n del espacio de estados (en concreto cuantizaci贸n Vectorial y tile coding). Como herramientas de an谩lisis de los comportamientos aprendidos se utilizan diagramas fundamentales (relaci贸n velocidad/densidad), mapas de densidad, cronogramas y rendimientos (en t茅rminos del porcentaje de agentes que consiguen llegar al objetivo). Conclusiones: Tras una bater铆a de experimentos en diferentes escenarios (un total de 6 escenarios distintos) y los correspondientes analisis de resultados, las conclusiones son las siguientes: - Se han conseguido comportamientos plausibles de peatones -Los comportamientos son robustos al escalado y presentan capacidades de abstracci贸n (comportamientos a niveles t谩ctico y de planificaci贸n) -Los comportamientos aprendidos son capaces de generar comportamientos colectivos emergentes -La comparaci贸n con otro modelo de peatones estandar (Modelo de Helbing) y los an谩lisis realizados a nivel de diagramas fundamentales, indican que la din谩mica aprendida es coherente y similar a una din谩mica de peatones

    Reinforcement Learning approaches for Artificial Pancreas Control

    Get PDF
    openPeople with type 1 diabetes are affected by a chronic deficiency of insulin secretion in their body; as a consequence, insulin has to be continually self-administered to keep in check their blood glucose levels. In recent years, rapid technological advancements in continuous glucose monitoring and insulin administration systems have allowed researchers to work on automated control methods for diabetes management, commonly referred to as Artificial Pancreas. The development of control algorithms in this context is a very active research area. While traditional control approaches have been the main focus so far, Reinforcement Learning (RL) seems to offer a compelling alternative framework, which has not been thoroughly explored yet. This thesis investigates the employment of several RL approaches, based on the algorithm Sarsa lambda, on in silico patients, using the FDA accepted UVa-Padova Type 1 Diabetes simulator. The way the overall representation of the problem affects the performance of the system is discussed, underlying how each component fits into the general framework proposed and evaluating the pros and cons of each method. Particular emphasis is also placed on the interpretability of both the training process and the final policies obtained. Experimental results demonstrate that classic RL methods have the potential to be a viable future approach to achieve proper control and a good degree of personalization in glycemic regulation for diabetes management.People with type 1 diabetes are affected by a chronic deficiency of insulin secretion in their body; as a consequence, insulin has to be continually self-administered to keep in check their blood glucose levels. In recent years, rapid technological advancements in continuous glucose monitoring and insulin administration systems have allowed researchers to work on automated control methods for diabetes management, commonly referred to as Artificial Pancreas. The development of control algorithms in this context is a very active research area. While traditional control approaches have been the main focus so far, Reinforcement Learning (RL) seems to offer a compelling alternative framework, which has not been thoroughly explored yet. This thesis investigates the employment of several RL approaches, based on the algorithm Sarsa lambda, on in silico patients, using the FDA accepted UVa-Padova Type 1 Diabetes simulator. The way the overall representation of the problem affects the performance of the system is discussed, underlying how each component fits into the general framework proposed and evaluating the pros and cons of each method. Particular emphasis is also placed on the interpretability of both the training process and the final policies obtained. Experimental results demonstrate that classic RL methods have the potential to be a viable future approach to achieve proper control and a good degree of personalization in glycemic regulation for diabetes management

    Automated Reinforcement Learning:An Overview

    Get PDF
    Reinforcement Learning and recently Deep Reinforcement Learning are popular methods for solving sequential decision making problems modeled as Markov Decision Processes. RL modeling of a problem and selecting algorithms and hyper-parameters require careful considerations as different configurations may entail completely different performances. These considerations are mainly the task of RL experts; however, RL is progressively becoming popular in other fields where the researchers and system designers are not RL experts. Besides, many modeling decisions, such as defining state and action space, size of batches and frequency of batch updating, and number of timesteps are typically made manually. For these reasons, automating different components of RL framework is of great importance and it has attracted much attention in recent years. Automated RL provides a framework in which different components of RL including MDP modeling, algorithm selection and hyper-parameter optimization are modeled and defined automatically. In this article, we explore the literature and present recent work that can be used in automated RL. Moreover, we discuss the challenges, open questions and research directions in AutoRL

    Value Function Estimation in Optimal Control via Takagi-Sugeno Models and Linear Programming

    Full text link
    [ES] La presente Tesis emplea t茅cnicas de programaci贸n din谩mica y aprendizaje por refuerzo para el control de sistemas no lineales en espacios discretos y continuos. Inicialmente se realiza una revisi贸n de los conceptos b谩sicos de programaci贸n din谩mica y aprendizaje por refuerzo para sistemas con un n煤mero finito de estados. Se analiza la extensi贸n de estas t茅cnicas mediante el uso de funciones de aproximaci贸n que permiten ampliar su aplicabilidad a sistemas con un gran n煤mero de estados o sistemas continuos. Las contribuciones de la Tesis son: -Se presenta una metodolog铆a que combina identificaci贸n y ajuste de la funci贸n Q, que incluye la identificaci贸n de un modelo Takagi-Sugeno, el c谩lculo de controladores sub贸ptimos a partir de desigualdades matriciales lineales y el consiguiente ajuste basado en datos de la funci贸n Q a trav茅s de una optimizaci贸n monot贸nica. -Se propone una metodolog铆a para el aprendizaje de controladores utilizando programaci贸n din谩mica aproximada a trav茅s de programaci贸n lineal. La metodolog铆a hace que ADP-LP funcione en aplicaciones pr谩cticas de control con estados y acciones continuos. La metodolog铆a propuesta estima una cota inferior y superior de la funci贸n de valor 贸ptima a trav茅s de aproximadores funcionales. Se establecen pautas para los datos y la regularizaci贸n de regresores con el fin de obtener resultados satisfactorios evitando soluciones no acotadas o mal condicionadas. -Se plantea una metodolog铆a bajo el enfoque de programaci贸n lineal aplicada a programaci贸n din谩mica aproximada para obtener una mejor aproximaci贸n de la funci贸n de valor 贸ptima en una determinada regi贸n del espacio de estados. La metodolog铆a propone aprender gradualmente una pol铆tica utilizando datos disponibles s贸lo en la regi贸n de exploraci贸n. La exploraci贸n incrementa progresivamente la regi贸n de aprendizaje hasta obtener una pol铆tica convergida.[CA] La present Tesi empra t猫cniques de programaci贸 din脿mica i aprenentatge per refor莽 per al control de sistemes no lineals en espais discrets i continus. Inicialment es realitza una revisi贸 dels conceptes b脿sics de programaci贸 din脿mica i aprenentatge per refor莽 per a sistemes amb un nombre finit d'estats. S'analitza l'extensi贸 d'aquestes t猫cniques mitjan莽ant l'煤s de funcions d'aproximaci贸 que permeten ampliar la seua aplicabilitat a sistemes amb un gran nombre d'estats o sistemes continus. Les contribucions de la Tesi s贸n: -Es presenta una metodologia que combina identificaci贸 i ajust de la funci贸 Q, que inclou la identificaci贸 d'un model Takagi-Sugeno, el c脿lcul de controladors sub貌ptims a partir de desigualtats matricials lineals i el conseg眉ent ajust basat en dades de la funci贸 Q a trav茅s d'una optimitzaci贸 monot贸nica. -Es proposa una metodologia per a l'aprenentatge de controladors utilitzant programaci贸 din脿mica aproximada a trav茅s de programaci贸 lineal. La metodologia fa que ADP-LP funcione en aplicacions pr脿ctiques de control amb estats i accions continus. La metodologia proposada estima una cota inferior i superior de la funci贸 de valor 貌ptima a trav茅s de aproximadores funcionals. S'estableixen pautes per a les dades i la regularitzaci贸 de regresores amb la finalitat d'obtenir resultats satisfactoris evitant solucions no fitades o mal condicionades. -Es planteja una metodologia sota l'enfocament de programaci贸 lineal aplicada a programaci贸 din脿mica aproximada per a obtenir una millor aproximaci贸 de la funci贸 de valor 貌ptima en una determinada regi贸 de l'espai d'estats. La metodologia proposa aprendre gradualment una pol铆tica utilitzant dades disponibles nom茅s a la regi贸 d'exploraci贸. L'exploraci贸 incrementa progressivament la regi贸 d'aprenentatge fins a obtenir una pol铆tica convergida.[EN] The present Thesis employs dynamic programming and reinforcement learning techniques in order to obtain optimal policies for controlling nonlinear systems with discrete and continuous states and actions. Initially, a review of the basic concepts of dynamic programming and reinforcement learning is carried out for systems with a finite number of states. After that, the extension of these techniques to systems with a large number of states or continuous state systems is analysed using approximation functions. The contributions of the Thesis are: -A combined identification/Q-function fitting methodology, which involves identification of a Takagi-Sugeno model, computation of (sub)optimal controllers from Linear Matrix Inequalities, and the subsequent data-based fitting of Q-function via monotonic optimisation. -A methodology for learning controllers using approximate dynamic programming via linear programming is presented. The methodology makes that ADP-LP approach can work in practical control applications with continuous state and input spaces. The proposed methodology estimates a lower bound and upper bound of the optimal value function through functional approximators. Guidelines are provided for data and regressor regularisation in order to obtain satisfactory results avoiding unbounded or ill-conditioned solutions. -A methodology of approximate dynamic programming via linear programming in order to obtain a better approximation of the optimal value function in a specific region of state space. The methodology proposes to gradually learn a policy using data available only in the exploration region. The exploration progressively increases the learning region until a converged policy is obtained.This work was supported by the National Department of Higher Education, Science, Technology and Innovation of Ecuador (SENESCYT), and the Spanish ministry of Economy and European Union, grant DPI2016-81002-R (AEI/FEDER,UE). The author also received the grant for a predoctoral stay, Programa de Becas Iberoam茅rica- Santander Investigaci贸n 2018, of the Santander Bank.D铆az Iza, HP. (2020). Value Function Estimation in Optimal Control via Takagi-Sugeno Models and Linear Programming [Tesis doctoral]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/139135TESI
    corecore