    CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

    Off-policy temporal difference (TD) methods are a powerful class of reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD algorithms are not commonly used in combination with feature normalization techniques, despite positive effects of normalization in other domains. We show that naive application of existing normalization techniques is indeed not effective, but that well-designed normalization improves optimization stability and removes the necessity of target networks. In particular, we introduce a normalization based on a mixture of on- and off-policy transitions, which we call cross-normalization. It can be regarded as an extension of batch normalization that re-centers data for two different distributions, as present in off-policy learning. Applied to DDPG and TD3, cross-normalization improves over the state of the art across a range of MuJoCo benchmark tasks

    Stabilizing Q Learning Via Soft Mellowmax Operator

    Learning complicated value functions in high dimensional state space by function approximation is a challenging task, partially due to that the max-operator used in temporal difference updates can theoretically cause instability for most linear or non-linear approximation schemes. Mellowmax is a recently proposed differentiable and non-expansion softmax operator that allows a convergent behavior in learning and planning. Unfortunately, the performance bound for the fixed point it converges to remains unclear, and in practice, its parameter is sensitive to various domains and has to be tuned case by case. Finally, the Mellowmax operator may suffer from oversmoothing as it ignores the probability being taken for each action when aggregating them. In this paper, we address all the above issues with an enhanced Mellowmax operator, named SM2 (Soft Mellowmax). Particularly, the proposed operator is reliable, easy to implement, and has provable performance guarantee, while preserving all the advantages of Mellowmax. Furthermore, we show that our SM2 operator can be applied to the challenging multi-agent reinforcement learning scenarios, leading to stable value function approximation and state of the art performance.Comment: 14 page

    Sample efficiency, transfer learning and interpretability for deep reinforcement learning

    Deep learning has revolutionised artificial intelligence, where the application of increased compute to train neural networks on large datasets has resulted in improvements in real-world applications such as object detection, text-to-speech synthesis and machine translation. Deep reinforcement learning (DRL) has similarly shown impressive results in board and video games, but less so in real-world applications such as robotic control. To address this, I have investigated three factors prohibiting further deployment of DRL: sample efficiency, transfer learning, and interpretability. To decrease the amount of data needed to train DRL systems, I have explored various storage strategies and exploration policies for episodic control (EC) algorithms, resulting in the application of online clustering to improve the memory efficiency of EC algorithms, and the maximum entropy mellowmax policy for improving the sample efficiency and final performance of the same EC algorithms. To improve performance during transfer learning, I have shown that a multi-headed neural network architecture trained using hierarchical reinforcement learning can retain the benefits of positive transfer between tasks while mitigating the interference effects of negative transfer. I additionally investigated the use of multi-headed architectures to reduce catastrophic forgetting under the continual learning setting. While the use of multiple heads worked well within a simple environment, it was of limited use within a more complex domain, indicating that this strategy does not scale well. Finally, I applied a wide range of quantitative and qualitative techniques to better interpret trained DRL agents. In particular, I compared the effects of training DRL agents both with and without visual domain randomisation (DR), a popular technique to achieve simulation-to-real transfer, providing a series of tests that can be applied before real-world deployment. One of the major findings is that DR produces more entangled representations within trained DRL agents, indicating quantitatively that they are invariant to nuisance factors associated with the DR process. Additionally, while my environment allowed agents trained without DR to succeed without requiring complex recurrent processing, all agents trained with DR appear to integrate information over time, as evidenced through ablations on the recurrent state.Open Acces

    On choice models in the context of MDPs

    Cette thèse se penche sur les modèles de choix, des distributions sur des ensembles d'alternatives. Les modèles de choix sur les processus décisionnels de Markov (MDP) peuvent décomposer de très grands espaces alternatifs en procédures étape par étape conçues pour non seulement combattre la malédiction de la dimensionnalité mais aussi pour mieux refléter la dynamique sous-jacente. La première partie est consacrée à l'estimation du temps de trajet dans le cadre de la modélisation du choix de chemin. Les modèles de choix de chemin sont des modèles de choix sur l'ensemble des chemins utilisés pour modéliser le flux de circulation. Intuitivement, le temps de trajet est l'une des caractéristiques les plus importantes lors du choix des chemins, mais les temps de trajet ne sont pas toujours connus. En revanche, le cadre classique suppose que ces deux étapes sont séquentielles, car les temps de trajet des arcs font partie de l'entrée du processus d'estimation du choix de chemin. Pourtant, les interdépendances complexes signifient que ce modèle de choix de chemin peut complémenter toute observation lors de l'estimation des temps de trajet. Nous construisons un modèle statistique pour l'estimation du temps de trajet et proposons de marginaliser les caractéristiques non observées. En utilisant ces idées, nous montrons que nous sommes capables d'apprendre des modèles de choix de chemin sans observer de chemins réels et à différentes granularités. La deuxième partie se concentre sur les échecs des MDP régularisés et comment la régularisation peut avoir des effets secondaires inattendus, tels que la divergence dans les chemins stochastiques les plus courts ou des fonctions de valeur déraisonnablement grandes. Les MDP régularisés ne sont rien d'autre qu'une application des modèles de choix aux MDP. Ils sont utilisés dans l'apprentissage par renforcement (RL) pour obtenir, entre autres choses, un modèle de choix sur les trajectoires possibles pour l'apprentissage par renforcement inverse, transférer des connaissances préalables au modèle, ou obtenir des politiques qui exploitent tous les objectifs dans l'environnement. Ces effets secondaires sont exacerbés dans les espaces d'action dépendants de l'état. Comme mesure d'atténuation, nous introduisons deux transformations potentielles, et nous évaluons leur performance sur un problème de conception de médicaments.This thesis delves on choice models, distributions on sets of alternatives. Choice models on Markov decision processes (MDPs) can break down very large alternative spaces into step-by-step procedures designed to not only tackle the curse of dimensionality but also to reflect the underlying dynamics better. The first part is devoted to travel time estimation as part of path choice modeling. Path choice models are choice models on the set of paths used to model traffic flow. Intuitively, travel time is one of the more important features when choosing paths, yet travel times are not always known. In contrast, the classical setting assumes that these two steps are sequential, as arc travel times are part of the input of the path choice estimation process. Yet the intricate interdependences mean that that path choice model can complement any observation when estimating travel times. We build a statistical model for travel time estimation and propose marginalizing the unobserved features. Using these ideas, we show that we are able to learn path choice models without observing actual paths and at different granularity. The second part focuses on the failings of regularized MDPs and how regularization may have unexpected side effects, such as divergence in stochastic shortest paths or unreasonably large value functions. Regularized MDPs are nothing but an application of choice models to MDPs. They are used in reinforcement learning (RL) to get, among other things, a choice model on possible trajectories for inverse reinforcement learning, transfer prior knowledge to the model, or to get policies that exploit all goals in the environment. These side effects are exacerbated in state-dependent action spaces. As a mitigation, we introduce two potential transformations, and we benchmark their performance on a drug design problem

    Regularized Softmax Deep Multi-Agent Q-Learning

    Tackling overestimation in Q-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multiagent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent Q-Learning, is general and can be applied to any Q-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks

    Learning and planning with noise in optimization and reinforcement learning

    La plupart des algorithmes modernes d'apprentissage automatique intègrent un certain degré d'aléatoire dans leurs processus, que nous appellerons le bruit, qui peut finalement avoir un impact sur les prédictions du modèle. Dans cette thèse, nous examinons de plus près l'apprentissage et la planification en présence de bruit pour les algorithmes d'apprentissage par renforcement et d'optimisation. Les deux premiers articles présentés dans ce document se concentrent sur l'apprentissage par renforcement dans un environnement inconnu, et plus précisément sur la façon dont nous pouvons concevoir des algorithmes qui utilisent la stochasticité de leur politique et de l'environnement à leur avantage. Notre première contribution présentée dans ce document se concentre sur le cadre de l'apprentissage par renforcement non supervisé. Nous montrons comment un agent laissé seul dans un monde inconnu sans but précis peut apprendre quels aspects de l'environnement il peut contrôler indépendamment les uns des autres, ainsi qu'apprendre conjointement une représentation latente démêlée de ces aspects que nous appellerons \emph{facteurs de variation}. La deuxième contribution se concentre sur la planification dans les tâches de contrôle continu. En présentant l'apprentissage par renforcement comme un problème d'inférence, nous empruntons des outils provenant de la littérature sur les m\'{e}thodes de Monte Carlo séquentiel pour concevoir un algorithme efficace et théoriquement motiv\'{e} pour la planification probabiliste en utilisant un modèle appris du monde. Nous montrons comment l'agent peut tirer parti de note objectif probabiliste pour imaginer divers ensembles de solutions. Les deux contributions suivantes analysent l'impact du bruit de gradient dû à l'échantillonnage dans les algorithmes d'optimisation. La troisième contribution examine le rôle du bruit de l'estimateur du gradient dans l'estimation par maximum de vraisemblance avec descente de gradient stochastique, en explorant la relation entre la structure du bruit du gradient et la courbure locale sur la généralisation et la vitesse de convergence du modèle. Notre quatrième contribution revient sur le sujet de l'apprentissage par renforcement pour analyser l'impact du bruit d'échantillonnage sur l'algorithme d'optimisation de la politique par ascension du gradient. Nous constatons que le bruit d'échantillonnage peut avoir un impact significatif sur la dynamique d'optimisation et les politiques découvertes en apprentissage par renforcement.Most modern machine learning algorithms incorporate a degree of randomness in their processes, which we will refer to as noise, which can ultimately impact the model's predictions. In this thesis, we take a closer look at learning and planning in the presence of noise for reinforcement learning and optimization algorithms. The first two articles presented in this document focus on reinforcement learning in an unknown environment, specifically how we can design algorithms that use the stochasticity of their policy and of the environment to their advantage. Our first contribution presented in this document focuses on the unsupervised reinforcement learning setting. We show how an agent left alone in an unknown world without any specified goal can learn which aspects of the environment it can control independently from each other as well as jointly learning a disentangled latent representation of these aspects, or factors of variation. The second contribution focuses on planning in continuous control tasks. By framing reinforcement learning as an inference problem, we borrow tools from Sequential Monte Carlo literature to design a theoretically grounded and efficient algorithm for probabilistic planning using a learned model of the world. We show how the agent can leverage the uncertainty of the model to imagine a diverse set of solutions. The following two contributions analyze the impact of gradient noise due to sampling in optimization algorithms. The third contribution examines the role of gradient noise in maximum likelihood estimation with stochastic gradient descent, exploring the relationship between the structure of the gradient noise and local curvature on the generalization and convergence speed of the model. Our fourth contribution returns to the topic of reinforcement learning to analyze the impact of sampling noise on the policy gradient algorithm. We find that sampling noise can significantly impact the optimization dynamics and policies discovered in on-policy reinforcement learning

    Modelling variations in human learning in probabilistic decision-making tasks

    This thesis focused on evaluating the capacity of models of human learning to encapsulate the action choices of a range of individuals performing probabilistic decision-making tasks. To do so, an extensible evaluation framework, Tinker Taylor py (TTpy), was developed in Python allowing models to be compared like-for-like across a range of tasks. TTpy allows models, tasks and fitting methods to be added or replaced without affecting the other parts of the simulation and fitting process. Models were drawn from the reinforcement learning literature along with a few similarly structured Bayesian learning models. The fitting assumed that the same model was used throughout a task to make all the choices. Using TTpy, significant uncertainty was found in parameter recovery for short, simple tasks across a range of models. This was traced back to significant overlap in the action sequences plausibly produced by different combinations of parameters. Replacing softmax with epsilon greedy, as the way of calculating the action choice probabilities, was found to improve parameter recovery in simulated data. Datasets from three existing unpublished probabilistic decision-making tasks were examined. These datasets were chosen as they contained information on extraversion for all their participants, their tasks were well established, and the tasks had a gains-only promotion focus. Only one of the three tasks provided models where most of the model participant fits had strong evidence that they were better fits than uniform random action choices. In light of the difficulties in parameter recovery for individual participants, the unusual step was taken of averaging the recovered parameters across a subset of the best performing and most consistently recovered models within the same family. A significant correlation was found between this learning rate parameter and the participant extraversion measure when the softmax parameter variance was taken into account