19 research outputs found

    SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics

    Full text link
    Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms

    Robust and Efficient Planning using Adaptive Entropy Tree Search

    Full text link
    In this paper, we present the Adaptive EntropyTree Search (ANTS) algorithm. ANTS builds on recent successes of maximum entropy planning while mitigating its arguably major drawback - sensitivity to the temperature setting. We endow ANTS with a mechanism, which adapts the temperature to match a given range of action selection entropy in the nodes of the planning tree. With this mechanism, the ANTS planner enjoys remarkable hyper-parameter robustness, achieves high scores on the Atari benchmark, and is a capable component of a planning-learning loop akin to AlphaZero. We believe that all these features make ANTS a compelling choice for a general planner for complex tasks

    Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization

    Full text link
    Exploration remains a key challenge in deep reinforcement learning (RL). Optimism in the face of uncertainty is a well-known heuristic with theoretical guarantees in the tabular setting, but how best to translate the principle to deep reinforcement learning, which involves online stochastic gradients and deep network function approximators, is not fully understood. In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently, with guarantees even under function approximation. Our new objective is a zero-sum two-player game derived from endowing the agent with an epistemic-risk-seeking utility function, which converts uncertainty into value and encourages the agent to explore uncertain states. We show that the solution to this game minimizes an upper bound on the regret, with the 'players' each attempting to minimize one component of a particular regret decomposition. We derive a new model-free algorithm which we call 'epistemic-risk-seeking actor-critic' (ERSAC), which is simply an application of simultaneous stochastic gradient ascent-descent to the game. Finally, we discuss a recipe for incorporating off-policy data and show that combining the risk-seeking objective with replay data yields a double benefit in terms of statistical efficiency. We conclude with some results showing good performance of a deep RL agent using the technique on the challenging 'DeepSea' environment, showing significant performance improvements even over other efficient exploration techniques, as well as improved performance on the Atari benchmark

    Reinforcement learning control of a biomechanical model of the upper extremity

    Get PDF
    Among the infinite number of possible movements that can be produced, humans are commonly assumed to choose those that optimize criteria such as minimizing movement time, subject to certain movement constraints like signal-dependent and constant motor noise. While so far these assumptions have only been evaluated for simplified point-mass or planar models, we address the question of whether they can predict reaching movements in a full skeletal model of the human upper extremity. We learn a control policy using a motor babbling approach as implemented in reinforcement learning, using aimed movements of the tip of the right index finger towards randomly placed 3D targets of varying size. We use a state-of-the-art biomechanical model, which includes seven actuated degrees of freedom. To deal with the curse of dimensionality, we use a simplified second-order muscle model, acting at each degree of freedom instead of individual muscles. The results confirm that the assumptions of signal-dependent and constant motor noise, together with the objective of movement time minimization, are sufficient for a state-of-the-art skeletal model of the human upper extremity to reproduce complex phenomena of human movement, in particular Fitts' Law and the 2/3 Power Law. This result supports the notion that control of the complex human biomechanical system can plausibly be determined by a set of simple assumptions and can easily be learned.Comment: 19 pages, 7 figure

    Guía para el modelo de distribución de especies por Máxima Entropía, estudio de caso de la “lora nuca amarilla” Amazona auropalliata en El Salvador

    Get PDF
    The aim of this work was to offer a guide for the analysis and interpretation of the model MaxEnt, including the quality requirements to generate solid results and provide researchers the key elements of this powerful ecological tool, in order to improve the conservation and management of the biological diversity of El Salvador. Hence, a potential distributional model of Amazona auropalliata, a species cataloged in danger of extinction in the country, was conducted. The model had an AUC (Area under the Curve) value of 0.856 considered reliable. The variables mean temperature of the wettest month, precipitation of the warmest four-month period and precipitation in the driest period, were mainly contributing to the model. The potential distribution of the species according to the model occurs mainly in the departments of San Salvador, Santa Ana, Ahuachapán, Sonsonate, Usulután and La Libertad. As a result, based on statistical analysis, a bioclimatic profile of the species determined by this contribution will facilitate the development of future studies, including the effects of Climate Change.Esta contribución pretende facilitar una guía para el análisis e interpretación de información modelada con MaxEnt, incluyendo los elementos centrales de calidad para generar resultados robustos de gran utilidad para la conservación y gestión de la diversidad biológica de El Salvador. Para facilitar el proceso de entrenamiento, se efectuó un modelo distribucional de Amazona auropalliata, especie catalogada en peligro de extinción en el país. El modelo obtenido presentó un valor AUC de 0.856 por lo que puede considerarse confiable, con las variables temperatura media del mes más húmedo, la precipitación del cuatrimestre más cálido y la precipitación en el período más seco, aportando en mayor medida al modelo. La distribución potencial de la especie según el modelo, ocurre principalmente en los departamentos de San Salvador, Santa Ana, Ahuachapán, Sonsonate, Usulután y La Libertad. Finalmente, con base en análisis estadísticos, se construyó un perfil bioclimático de la especie determinado por esta contribución, que facilitará el desarrollo de estudios futuros, incluyendo los efectos del Cambio Climátic
    corecore