5,328 research outputs found

    Enhancing Exploration and Safety in Deep Reinforcement Learning

    Get PDF
    A Deep Reinforcement Learning (DRL) agent tries to learn a policy maximizing a long-term objective by trials and errors in large state spaces. However, this learning paradigm requires a non-trivial amount of interactions in the environment to achieve good performance. Moreover, critical applications, such as robotics, typically involve safety criteria to consider while designing novel DRL solutions. Hence, devising safe learning approaches with efficient exploration is crucial to avoid getting stuck in local optima, failing to learn properly, or causing damages to the surrounding environment. This thesis focuses on developing Deep Reinforcement Learning algorithms to foster efficient exploration and safer behaviors in simulation and real domains of interest, ranging from robotics to multi-agent systems. To this end, we rely both on standard benchmarks, such as SafetyGym, and robotic tasks widely adopted in the literature (e.g., manipulation, navigation). This variety of problems is crucial to assess the statistical significance of our empirical studies and the generalization skills of our approaches. We initially benchmark the sample efficiency versus performance trade-off between value-based and policy-gradient algorithms. This part highlights the benefits of using non-standard simulation environments (i.e., Unity), which also facilitates the development of further optimization for DRL. We also discuss the limitations of standard evaluation metrics (e.g., return) in characterizing the actual behaviors of a policy, proposing the use of Formal Verification (FV) as a practical methodology to evaluate behaviors over desired specifications. The second part introduces Evolutionary Algorithms (EAs) as a gradient-free complimentary optimization strategy. In detail, we combine population-based and gradient-based DRL to diversify exploration and improve performance both in single and multi-agent applications. For the latter, we discuss how prior Multi-Agent (Deep) Reinforcement Learning (MARL) approaches hinder exploration, proposing an architecture that favors cooperation without affecting exploration

    Safe Deep Reinforcement Learning: Enhancing the Reliability of Intelligent Systems

    Get PDF
    In the last few years, the impressive success of deep reinforcement learning (DRL) agents in a wide variety of applications has led to the adoption of these systems in safety-critical contexts (e.g., autonomous driving, robotics, and medical applications), where expensive hardware and human safety can be involved. In such contexts, an intelligent learning agent must adhere to certain requirements that go beyond the simple accomplishment of the task and typically include constraints on the agent's behavior. Against this background, this thesis proposes a set of training and validation methodologies that constitute a unified pipeline to generate safe and reliable DRL agents. In the first part of this dissertation, we focus on the problem of constrained DRL, leaving the challenging problem of the formal verification of deep neural networks for the second part of this work. As humans, in our growing process, the help of a mentor is crucial to learn effective strategies to solve a problem while a learning process driven only by a trial-and-error approach usually leads to unsafe and inefficient solutions. Similarly, a pure end-to-end deep reinforcement learning approach often results in suboptimal policies, which typically translates into unpredictable, and thus unreliable, behaviors. Following this intuition, we propose to impose a set of constraints into the DRL loop to guide the training process. These requirements, which typically encode domain expert knowledge, can be seen as suggestions that the agent should follow but is allowed to sometimes ignore if useful to maximize the reward signal. A foundational requirement for our work is finding a proper strategy to define and formally encode these constraints (which we refer to as \textit{rules}). In this thesis, we propose to exploit a formal language inherited from the software engineering community: scenario-based programming (SBP). For the actual training, we rely on the constrained reinforcement learning paradigm, proposing an extended version of the Lagrangian PPO algorithm. Recalling the parallelism with human beings, before being authorized to perform safety-critical operations, we must obtain a certification (e.g., a license to drive a car or a degree to perform medical operations). In the second part of this dissertation, we apply this concept in a deep reinforcement learning context, where the intelligent agents are controlled by artificial neural networks. In particular, we propose to perform a model selection phase after the training to find models that formally respect some given safety requirements before the deployment. However, DNNs have long been considered unpredictable black boxes and thus unsuitable for safety-critical contexts. Against this background, we build upon the emerging field of formal verification for neural networks to extend state-of-the-art approaches to robotic decision-making contexts. We propose ``ProVe", a verification tool for decision-making DNNs that quantifies the probability of violating the specified requirements. In the last chapter of this thesis, we provide a complete case study on a popular robotic problem: ``mapless navigation". Here, we show a concrete example of the application of our pipeline, starting from the definition of the requirements to the training and the final formal verification phase, to finally obtain a provably safe and effective agent

    Charging Games in Networks of Electrical Vehicles

    Full text link
    In this paper, a static non-cooperative game formulation of the problem of distributed charging in electrical vehicle (EV) networks is proposed. This formulation allows one to model the interaction between several EV which are connected to a common residential distribution transformer. Each EV aims at choosing the time at which it starts charging its battery in order to minimize an individual cost which is mainly related to the total power delivered by the transformer, the location of the time interval over which the charging operation is performed, and the charging duration needed for the considered EV to have its battery fully recharged. As individual cost functions are assumed to be memoryless, it is possible to show that the game of interest is always an ordinal potential game. More precisely, both an atomic and nonatomic versions of the charging game are considered. In both cases, equilibrium analysis is conducted. In particular, important issues such as equilibrium uniqueness and efficiency are tackled. Interestingly, both analytical and numerical results show that the efficiency loss due to decentralization (e.g., when cost functions such as distribution network Joule losses or life of residential distribution transformers when no thermal inertia is assumed) induced by charging is small and the corresponding "efficiency", a notion close to the Price of Anarchy, tends to one when the number of EV increases.Comment: 8 pages, 4 figures, keywords: Charging games - electrical vehicle - distribution networks - potential games - Nash equilibrium - price of anarch

    Stability Analysis for Autonomous Vehicle Navigation Trained over Deep Deterministic Policy Gradient

    Get PDF
    The Deep Deterministic Policy Gradient (DDPG) algorithm is a reinforcement learning algorithm that combines Q-learning with a policy. Nevertheless, this algorithm generates failures that are not well understood. Rather than looking for those errors, this study presents a way to evaluate the suitability of the results obtained. Using the purpose of autonomous vehicle navigation, the DDPG algorithm is applied, obtaining an agent capable of generating trajectories. This agent is evaluated in terms of stability through the Lyapunov function, verifying if the proposed navigation objectives are achieved. The reward function of the DDPG is used because it is unknown if the neural networks of the actor and the critic are correctly trained. Two agents are obtained, and a comparison is performed between them in terms of stability, demonstrating that the Lyapunov function can be used as an evaluation method for agents obtained by the DDPG algorithm. Verifying the stability at a fixed future horizon, it is possible to determine whether the obtained agent is valid and can be used as a vehicle controller, so a task-satisfaction assessment can be performed. Furthermore, the proposed analysis is an indication of which parts of the navigation area are insufficient in training terms.The current study has been sponsored by the Government of the Basque Country-ELKARTEK21/10 KK-2021/00014 research program “Estudio de nuevas técnicas de inteligencia artificial basadas en Deep Learning dirigidas a la optimización de procesos industrials”

    On iterated learning for task-oriented dialogue

    Full text link
    Dans le traitement de langue et des système de dialogue, il est courant de pré-entraîner des modèles de langue sur corpus humain avant de les affiner par le biais d'un simulateur et de résolution de tâches. Malheuresement, ce type d'entrainement tend aussi à induire un phénomène connu sous le nom de dérive du langage. Concrétement, les propriétés syntaxiques et sémantiques de la langue intiallement apprise se détériorent: les agents se concentrent uniquement sur la résolution de la tâche, et non plus sur la préservation de la langue. En s'inspirant des travaux en sciences cognitives, et notamment l'apprentigssage itératif Kirby and Griffiths (2014), nous proposons ici une approche générique pour contrer cette dérive du langage. Nous avons appelé cette méthode Seeded iterated learning (SIL), ou apprentissage itératif capitalisé. Ce travail a été publié sous le titre (Lu et al., 2020b) et est présenté au chapitre 2. Afin d'émuler la transmission de la langue entre chaque génération d'agents, un agent étudiant est d'abord pré-entrainé avant d'être affiné de manière itérative, et ceci, en imitant des données échantillonnées à partir d'un agent enseignant nouvellement formé. À chaque génération, l'enseignant est créé en copiant l'agent étudiant, avant d'être de nouveau affiné en maximisant le taux de réussite de la tâche sous-jacente. Dans un second temps, nous présentons Supervised Seeded iterated learning (SSIL) dans le chapitre 3, où apprentissage itératif capitalisé avec supervision, qui a été publié sous le titre (Lu et al., 2020b). SSIL s'appuie sur SIL en le combinant avec une autre méthode populaire appelée Supervised SelfPlay (S2P) (Gupta et al., 2019), où apprentissage supervisé par auto-jeu. SSIL est capable d'atténuer les problèmes de S2P et de SIL, i.e. la dérive du langage dans les dernier stades de l'entrainement tout en préservant une plus grande diversité linguistique. Tout d'abord, nous évaluons nos méthodes dans sous la forme d'une preuve de concept à traver le Jeu de Lewis avec du langage synthetique. Dans un second temps, nous l'étendons à un jeu de traduction se utilisant du langage naturel. Dans les deux cas, nous soulignons l'efficacité de nos méthodes par rapport aux autres méthodes de la litterature. Dans le chapitre 1, nous discutons des concepts de base nécessaires à la compréhension des articles présentés dans les chapitres 2 et 3. Nous décrivons le problème spécifique du dialogue orienté tâche, y compris les approches actuelles et les défis auxquels ils sont confrontés : en particulier, la dérive linguistique. Nous donnons également un aperçu du cadre d'apprentissage itéré. Certaines sections du chapitre 1 sont empruntées aux articles pour des raisons de cohérence et de facilité de compréhension. Le chapitre 2 comprend les travaux publiés sous le nom de (Lu et al., 2020b) et le chapitre 3 comprend les travaux publiés sous le nom de (Lu et al., 2020a), avant de conclure au chapitre 4.In task-oriented dialogue, pretraining on human corpus followed by finetuning in a simulator using selfplay suffers from a phenomenon called language drift. The syntactic and semantic properties of the learned language deteriorates as the agents only focuses on solving the task. Inspired by the iterative learning framework in cognitive science Kirby and Griffiths (2014), we propose a generic approach to counter language drift called Seeded iterated learning (SIL). This work was published as (Lu et al., 2020b) and is presented in Chapter 2. In an attempt to emulate transmission of language between generations, a pretrained student agent is iteratively refined by imitating data sampled from a newly trained teacher agent. At each generation, the teacher is created by copying the student agent, before being finetuned to maximize task completion.We further introduce Supervised Seeded iterated learning (SSIL) in Chapter 3, work which was published as (Lu et al., 2020a). SSIL builds upon SIL by combining it with the other popular method called Supervised SelfPlay (S2P) (Gupta et al., 2019). SSIL is able to mitigate the problems of both S2P and SIL namely late-stage training collapse and low language diversity. We evaluate our methods in a toy setting of Lewis Game, and then scale it up to the translation game with natural language. In both settings, we highlight the efficacy of our methods compared to the baselines. In Chapter 1, we talk about the core concepts required for understanding the papers presented in Chapters 2 and 3. We describe the specific problem of task-oriented dialogue including current approaches and the challenges they face: particularly, the challenge of language drift. We also give an overview of the iterated learning framework. Some sections in Chapter 1 are borrowed from the papers for coherence and ease of understanding. Chapter 2 comprises of the work published as (Lu et al., 2020b) and Chapter 3 comprises of the work published as (Lu et al., 2020a). Chapter 4 gives a conclusion on the work

    Distributed Multi-Robot Learning using Particle Swarm Optimization

    Get PDF
    This thesis studies the automatic design and optimization of high-performing robust controllers for mobile robots using exclusively on-board resources. Due to the often large parameter space and noisy performance metrics, this constitutes an expensive optimization problem. Population-based learning techniques have been proven to be effective in dealing with noise and are thus promising tools to approach this problem. We focus this research on the Particle Swarm Optimization (PSO) algorithm, which, in addition to dealing with noise, allows a distributed implementation, speeding up the optimization process and adding robustness to failure of individual agents. In this thesis, we systematically analyze the different variables that affect the learning process for a multi-robot obstacle avoidance benchmark. These variables include algorithmic parameters, controller architecture, and learning and testing environments. The analysis is performed on experimental setups of increasing evaluation time and complexity: numerical benchmark functions, high-fidelity simulations, and experiments with real robots. Based on this analysis, we apply the distributed PSO framework to learn a more complex, collaborative task: flocking. This attempt to learn a collaborative task in a distributed manner on a large parameter space is, to our knowledge, the first of such kind. In addition, we address the problem of noisy performance evaluations encountered in these robotic tasks and present a %new distributed PSO algorithm for dealing with noise suitable for resource-constrained mobile robots due to its low requirements in terms of memory and limited local communication

    In Pursuit of Desirable Equilibria in Large Scale Networked Systems

    Get PDF
    This thesis addresses an interdisciplinary problem in the context of engineering, computer science and economics: In a large scale networked system, how can we achieve a desirable equilibrium that benefits the system as a whole? We approach this question from two perspectives. On the one hand, given a system architecture that imposes certain constraints, a system designer must propose efficient algorithms to optimally allocate resources to the agents that desire them. On the other hand, given algorithms that are used in practice, a performance analyst must come up with tools that can characterize these algorithms and determine when they can be optimally applied. Ideally, the two viewpoints must be integrated to obtain a simple system design with efficient algorithms that apply to it. We study the design of incentives and algorithms in such large scale networked systems under three application settings, referred to herein via the subheadings: Incentivizing Sharing in Realtime D2D Networks: A Mean Field Games Perspective, Energy Coupon: A Mean Field Game Perspective on Demand Response in Smart Grids, Dynamic Adaptability Properties of Caching Algorithms, and Accuracy vs. Learning Rate of Multi-level Caching Algorithms. Our application scenarios all entail an asymptotic system scaling, and an equilibrium is defined in terms of a probability distribution over system states. The question in each case is to determine how to attain a probability distribution that possesses certain desirable properties. For the first two applications, we consider the design of specific mechanisms to steer the system toward a desirable equilibrium under self interested decision making. The environments in these problems are such that there is a set of shared resources, and a mechanism is used during each time step to allocate resources to agents that are selfish and interact via a repeated game. These models are motivated by resource sharing systems in the context of data communication, transportation, and power transmission networks. The objective is to ensure that the achieved equilibria are socially desirable. Formally, we show that a Mean Field Game can be used to accurately approximate these repeated game frameworks, and we describe mechanisms under which socially desirable Mean Field Equilibria exist. For the third application, we focus on performance analysis via new metrics to determine the value of the attained equilibrium distribution of cache contents when using different replacement algorithms in cache networks. The work is motivated by the fact that typical performance analysis of caching algorithms consists of determining hit probability under a fixed arrival process of requests, which does not account for dynamic variability of request arrivals. Our main contribution is to define a function which accounts for both the error due to time lag of learning the items' popularity, as well as error due to the inaccuracy of learning, and to characterize the tradeoff between the two that conventional algorithms achieve. We then use the insights gained in this exercise to design new algorithms that are demonstrably superior