Search CORE

14 research outputs found

Vanishing Bias Heuristic-guided Reinforcement Learning Algorithm

Author: Li Qinru
Xiang Hao
Publication venue
Publication date: 16/06/2023
Field of study

Reinforcement Learning has achieved tremendous success in the many Atari games. In this paper we explored with the lunar lander environment and implemented classical methods including Q-Learning, SARSA, MC as well as tiling coding. We also implemented Neural Network based methods including DQN, Double DQN, Clipped DQN. On top of these, we proposed a new algorithm called Heuristic RL which utilizes heuristic to guide the early stage training while alleviating the introduced human bias. Our experiments showed promising results for our proposed methods in the lunar lander environment.Comment: Robotics;Reinforcement Learning

arXiv.org e-Print Archive

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Author: Stone P.
Taylor M.E.
Whiteson S.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

International Migration, Integration and Social Cohesion online publications

Automated Mixed Resolution Acyclic Tiling in Reinforcement Learning

Author: Scopes Peter D
Publication venue: University of York
Publication date: 30/09/2015
Field of study

This thesis presents novel work on how to automatically alter a Tile Coding whilst simultaneously learning to improve both the quality of an agent’s policy and its speed of learning. It also identifies the detrimental effects of transition cycles in an MDP to Reinforcement Learning and Tile Coding. Reinforcement Learning (RL) (Sutton and Barto 1998) is a popular and widely-studied ma- chine learning technique, where an agent learns a policy through continual interactions with an environment, based on performing actions and observing their rewards. In the basic RL formulation, in order to guarantee learning an optimal policy, an agent needs to visit each state in the environment at least once (and often repeatedly). For this reason the speed of learning does not scale well to complex environments with large state spaces. Tile Coding (TC) (Albus 1981) is a popular value function approximation method that is able to reduce the size of a state space through approximation. In this approach, values from one or more state features are grouped into exhaustive partitions called tiles. However, as the state space becomes more granular, there is an increase of potential reduction in the precision and quality of the policy the agent is learning. As a rule of thumb, the larger the tiles are in a tiling, the faster the agent arrives at its final policy but the lower its quality; the smaller the tiles are in a tiling, the slower the agent arrives at its final policy but the higher its quality. Furthermore, using multiple, offset tilings can improve performance without the need for smaller tiles. The guarantees that surround common RL algorithms revolve around being able to visit every state in the environment at least once. However, many implementations of these algorithms use episode roll outs and can find themselves looping through a cycle of state-action pairs. This thesis theoretically and empirically shows that if the reward of each state-action pair in this transition cycle is identical then it is possible for the agent to temporarily diverge from learning the optimal policy. These detrimental effects of transition cycles can occur at any point of learning and, therefore, RL algorithms must heed them or risk sudden, temporary lacklustre performance. Furthermore, we consider the use of TC in conjunction with RL and find that it aggravates the detrimental effects of transition cycles to learning. This is caused by tiles inducing transition cycles. Tile Coding is still an effective and efficient method of approximation when the detrimental impacts of transition cycles are avoided. This motivates us to create a novel strategy for manual tile placement called Mixed Resolution Acyclic Tiling (MRAT). MRAT is based on heuristics derived from theoretical work and empirical studies conducted in this thesis. MRAT is empirically demonstrated to be a very effective way of improving the speed and quality of learning by using a non-uniform tile placement. MRAT is then automated and is empirically shown to outperform the state-of-the-art competitors and fixed TC. Automated MRAT (AMRAT) does not require parameter tuning and therefore has no hidden costs for its use unlike its competitors

White Rose E-theses Online

Fair and Scalable Orchestration of Network and Compute Resources for Virtual Edge Services

Author: Andres Garcia-Saavedra
Carla Fabiana Chiasserini
Corrado Puligheddu
Sharda Tripathi
Somreeta Pramanik
Publication venue: IEEE
Publication date: 01/01/2023
Field of study

The combination of service virtualization and edge computing allows for low latency services, while keeping data storage and processing local. However, given the limited resources available at the edge, a conflict in resource usage arises when both virtualized user applications and network functions need to be supported. Further, the concurrent resource request by user applications and network functions is often entangled, since the data generated by the former has to be transferred by the latter, and vice versa. In this paper, we first show through experimental tests the correlation between a video-based application and a vRAN. Then, owing to the complex involved dynamics, we develop a scalable reinforcement learning framework for resource orchestration at the edge, which leverages a Pareto analysis for provable fair and efficient decisions. We validate our framework, named VERA, through a real-time proof-of-concept implementation, which we also use to obtain datasets reporting real-world operational conditions and performance. Using such experimental datasets, we demonstrate that VERA meets the KPI targets for over 96% of the observation period and performs similarly when executed in our real-time implementation, with KPI differences below 12.4%. Further, its scaling cost is 54% lower than a centralized framework based on deep-Q networks

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Reinforcement learning in a multi-agent framework for pedestrian simulation

Author: Martinez Gil Francisco Antonio
Publication venue
Publication date: 01/01/2014
Field of study

El objetivo de la tesis consiste en la utilización de Aprendizaje por refuerzo (Reinforcement Learning) para generar simulaciones plausibles de peatones en diferentes entornos. Metodología Se ha desarrollado un marco de trabajo multi-agente donde cada agente virtual que aprende un comportamiento de navegación por interacción con el mundo virtual en el que se encuentra junto con el resto de agentes. El mundo virtual es simulado con un motor físico (ODE) que está calibrado con parámetros de peatones humanos extraídos de la bibliografía de la materia. El marco de trabajo es flexible y permite utilizar diferentes algoritmos de aprendizaje (en concreto Q-Learning y Sarsa(lambda) en combinación con diferentes técnicas de generalización del espacio de estados (en concreto cuantización Vectorial y tile coding). Como herramientas de análisis de los comportamientos aprendidos se utilizan diagramas fundamentales (relación velocidad/densidad), mapas de densidad, cronogramas y rendimientos (en términos del porcentaje de agentes que consiguen llegar al objetivo). Conclusiones: Tras una batería de experimentos en diferentes escenarios (un total de 6 escenarios distintos) y los correspondientes analisis de resultados, las conclusiones son las siguientes: - Se han conseguido comportamientos plausibles de peatones -Los comportamientos son robustos al escalado y presentan capacidades de abstracción (comportamientos a niveles táctico y de planificación) -Los comportamientos aprendidos son capaces de generar comportamientos colectivos emergentes -La comparación con otro modelo de peatones estandar (Modelo de Helbing) y los análisis realizados a nivel de diagramas fundamentales, indican que la dinámica aprendida es coherente y similar a una dinámica de peatones

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

Reinforcement Learning approaches for Artificial Pancreas Control

Author: DEI ROSSI ALVISE
Publication venue
Publication date: 27/09/2022
Field of study

openPeople with type 1 diabetes are affected by a chronic deficiency of insulin secretion in their body; as a consequence, insulin has to be continually self-administered to keep in check their blood glucose levels. In recent years, rapid technological advancements in continuous glucose monitoring and insulin administration systems have allowed researchers to work on automated control methods for diabetes management, commonly referred to as Artificial Pancreas. The development of control algorithms in this context is a very active research area. While traditional control approaches have been the main focus so far, Reinforcement Learning (RL) seems to offer a compelling alternative framework, which has not been thoroughly explored yet. This thesis investigates the employment of several RL approaches, based on the algorithm Sarsa lambda, on in silico patients, using the FDA accepted UVa-Padova Type 1 Diabetes simulator. The way the overall representation of the problem affects the performance of the system is discussed, underlying how each component fits into the general framework proposed and evaluating the pros and cons of each method. Particular emphasis is also placed on the interpretability of both the training process and the final policies obtained. Experimental results demonstrate that classic RL methods have the potential to be a viable future approach to achieve proper control and a good degree of personalization in glycemic regulation for diabetes management.People with type 1 diabetes are affected by a chronic deficiency of insulin secretion in their body; as a consequence, insulin has to be continually self-administered to keep in check their blood glucose levels. In recent years, rapid technological advancements in continuous glucose monitoring and insulin administration systems have allowed researchers to work on automated control methods for diabetes management, commonly referred to as Artificial Pancreas. The development of control algorithms in this context is a very active research area. While traditional control approaches have been the main focus so far, Reinforcement Learning (RL) seems to offer a compelling alternative framework, which has not been thoroughly explored yet. This thesis investigates the employment of several RL approaches, based on the algorithm Sarsa lambda, on in silico patients, using the FDA accepted UVa-Padova Type 1 Diabetes simulator. The way the overall representation of the problem affects the performance of the system is discussed, underlying how each component fits into the general framework proposed and evaluating the pros and cons of each method. Particular emphasis is also placed on the interpretability of both the training process and the final policies obtained. Experimental results demonstrate that classic RL methods have the potential to be a viable future approach to achieve proper control and a good degree of personalization in glycemic regulation for diabetes management

Padua Thesis and Dissertation Archive

Automated Reinforcement Learning:An Overview

Author: Kaymak Uzay
Refaei Afshar Reza
Vanschoren Joaquin
Zhang Yingqian
Publication venue
Publication date: 13/01/2022
Field of study

Reinforcement Learning and recently Deep Reinforcement Learning are popular methods for solving sequential decision making problems modeled as Markov Decision Processes. RL modeling of a problem and selecting algorithms and hyper-parameters require careful considerations as different configurations may entail completely different performances. These considerations are mainly the task of RL experts; however, RL is progressively becoming popular in other fields where the researchers and system designers are not RL experts. Besides, many modeling decisions, such as defining state and action space, size of batches and frequency of batch updating, and number of timesteps are typically made manually. For these reasons, automating different components of RL framework is of great importance and it has attracted much attention in recent years. Automated RL provides a framework in which different components of RL including MDP modeling, algorithm selection and hyper-parameter optimization are modeled and defined automatically. In this article, we explore the literature and present recent work that can be used in automated RL. Moreover, we discuss the challenges, open questions and research directions in AutoRL

arXiv.org e-Print Archive

Pure OAI Repository

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Author: A. G. Barto
A. Moore
C. Watkins
D. E. Goldberg
D. E. Moriarty
D. E. Moriarty
D. Whitley
G. Tesauro
G. Tesauro
I. Menache
I. Noda
I. Szita
J. Baxter
J. Pollack
J. S. Albus
K. O. Stanley
K. O. Stanley
M. A. Potter
M. G. Lagoudakis
M. Kearns
M. Powell
Matthew E. Taylor
N. J. Radcliffe
N. Kohl
N. Saravanan
P. Stagge
P. Stone
P. Stone
Peter Stone
R. E. Bellman
R. E. Bellman
R. H. Crites
R. I. Brafman
R. S. Sutton
R. S. Sutton
S. Whiteson
S. Whiteson
Shimon Whiteson
T. P. Runarsson
X. Yao
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Value Function Estimation in Optimal Control via Takagi-Sugeno Models and Linear Programming

Author: Díaz Iza Henry Paúl
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 23/03/2020
Field of study

[ES] La presente Tesis emplea técnicas de programación dinámica y aprendizaje por refuerzo para el control de sistemas no lineales en espacios discretos y continuos. Inicialmente se realiza una revisión de los conceptos básicos de programación dinámica y aprendizaje por refuerzo para sistemas con un número finito de estados. Se analiza la extensión de estas técnicas mediante el uso de funciones de aproximación que permiten ampliar su aplicabilidad a sistemas con un gran número de estados o sistemas continuos. Las contribuciones de la Tesis son: -Se presenta una metodología que combina identificación y ajuste de la función Q, que incluye la identificación de un modelo Takagi-Sugeno, el cálculo de controladores subóptimos a partir de desigualdades matriciales lineales y el consiguiente ajuste basado en datos de la función Q a través de una optimización monotónica. -Se propone una metodología para el aprendizaje de controladores utilizando programación dinámica aproximada a través de programación lineal. La metodología hace que ADP-LP funcione en aplicaciones prácticas de control con estados y acciones continuos. La metodología propuesta estima una cota inferior y superior de la función de valor óptima a través de aproximadores funcionales. Se establecen pautas para los datos y la regularización de regresores con el fin de obtener resultados satisfactorios evitando soluciones no acotadas o mal condicionadas. -Se plantea una metodología bajo el enfoque de programación lineal aplicada a programación dinámica aproximada para obtener una mejor aproximación de la función de valor óptima en una determinada región del espacio de estados. La metodología propone aprender gradualmente una política utilizando datos disponibles sólo en la región de exploración. La exploración incrementa progresivamente la región de aprendizaje hasta obtener una política convergida.[CA] La present Tesi empra tècniques de programació dinàmica i aprenentatge per reforç per al control de sistemes no lineals en espais discrets i continus. Inicialment es realitza una revisió dels conceptes bàsics de programació dinàmica i aprenentatge per reforç per a sistemes amb un nombre finit d'estats. S'analitza l'extensió d'aquestes tècniques mitjançant l'ús de funcions d'aproximació que permeten ampliar la seua aplicabilitat a sistemes amb un gran nombre d'estats o sistemes continus. Les contribucions de la Tesi són: -Es presenta una metodologia que combina identificació i ajust de la funció Q, que inclou la identificació d'un model Takagi-Sugeno, el càlcul de controladors subòptims a partir de desigualtats matricials lineals i el consegüent ajust basat en dades de la funció Q a través d'una optimització monotónica. -Es proposa una metodologia per a l'aprenentatge de controladors utilitzant programació dinàmica aproximada a través de programació lineal. La metodologia fa que ADP-LP funcione en aplicacions pràctiques de control amb estats i accions continus. La metodologia proposada estima una cota inferior i superior de la funció de valor òptima a través de aproximadores funcionals. S'estableixen pautes per a les dades i la regularització de regresores amb la finalitat d'obtenir resultats satisfactoris evitant solucions no fitades o mal condicionades. -Es planteja una metodologia sota l'enfocament de programació lineal aplicada a programació dinàmica aproximada per a obtenir una millor aproximació de la funció de valor òptima en una determinada regió de l'espai d'estats. La metodologia proposa aprendre gradualment una política utilitzant dades disponibles només a la regió d'exploració. L'exploració incrementa progressivament la regió d'aprenentatge fins a obtenir una política convergida.[EN] The present Thesis employs dynamic programming and reinforcement learning techniques in order to obtain optimal policies for controlling nonlinear systems with discrete and continuous states and actions. Initially, a review of the basic concepts of dynamic programming and reinforcement learning is carried out for systems with a finite number of states. After that, the extension of these techniques to systems with a large number of states or continuous state systems is analysed using approximation functions. The contributions of the Thesis are: -A combined identification/Q-function fitting methodology, which involves identification of a Takagi-Sugeno model, computation of (sub)optimal controllers from Linear Matrix Inequalities, and the subsequent data-based fitting of Q-function via monotonic optimisation. -A methodology for learning controllers using approximate dynamic programming via linear programming is presented. The methodology makes that ADP-LP approach can work in practical control applications with continuous state and input spaces. The proposed methodology estimates a lower bound and upper bound of the optimal value function through functional approximators. Guidelines are provided for data and regressor regularisation in order to obtain satisfactory results avoiding unbounded or ill-conditioned solutions. -A methodology of approximate dynamic programming via linear programming in order to obtain a better approximation of the optimal value function in a specific region of state space. The methodology proposes to gradually learn a policy using data available only in the exploration region. The exploration progressively increases the learning region until a converged policy is obtained.This work was supported by the National Department of Higher Education, Science, Technology and Innovation of Ecuador (SENESCYT), and the Spanish ministry of Economy and European Union, grant DPI2016-81002-R (AEI/FEDER,UE). The author also received the grant for a predoctoral stay, Programa de Becas Iberoamérica- Santander Investigación 2018, of the Santander Bank.Díaz Iza, HP. (2020). Value Function Estimation in Optimal Control via Takagi-Sugeno Models and Linear Programming [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/139135TESI

Crossref

RiuNet