199 research outputs found

    TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning

    Get PDF
    Combining deep model-free reinforcement learning with on-line planning is a promising approach to building on the successes of deep RL. On-line planning with look-ahead trees has proven successful in environments where transition models are known a priori. However, in complex environments where transition models need to be learned from data, the deficiencies of learned models have limited their utility for planning. To address these challenges, we propose TreeQN, a differentiable, recursive, tree-structured model that serves as a drop-in replacement for any value function network in deep RL with discrete actions. TreeQN dynamically constructs a tree by recursively applying a transition model in a learned abstract state space and then aggregating predicted rewards and state-values using a tree backup to estimate Q-values. We also propose ATreeC, an actor-critic variant that augments TreeQN with a softmax layer to form a stochastic policy network. Both approaches are trained end-to-end, such that the learned model is optimised for its actual use in the tree. We show that TreeQN and ATreeC outperform n-step DQN and A2C on a box-pushing task, as well as n-step DQN and value prediction networks (Oh et al. 2017) on multiple Atari games. Furthermore, we present ablation studies that demonstrate the effect of different auxiliary losses on learning transition models

    Counterfactual Multi-Agent Policy Gradients

    Full text link
    Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state

    An Investigation of the Bias-Variance Tradeoff in Meta-Gradients

    Full text link
    Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always adds bias and can also add variance to meta-gradient estimation. Meanwhile, meta-gradient estimation has been studied less in the important long-horizon setting, where backpropagation through the full inner optimization trajectories is not feasible. We study the bias and variance tradeoff arising from truncated backpropagation and sampling correction, and additionally compare to evolution strategies, which is a recently popular alternative strategy to long-horizon meta-learning. While prior work implicitly chooses points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study that relates existing estimators to each other

    Growing Action Spaces

    Get PDF
    In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces

    Caregivers' perceived adequacy of support in end-stage lung disease: results of a population survey.

    Get PDF
    BACKGROUND: End-stage lung disease (ESLD) is a frequent cause of death. What are the differences in the supports needed by caregivers of individuals with ESLD at end of life versus other life-limiting diagnoses? METHODS: The South Australian Health Omnibus is an annual, random, face-to-face, cross-sectional survey. In 2002, 2003 and 2005-2007, respondents were asked a range of questions about end-of-life care; there were approximately 3000 survey participants annually (participation rate 77.9%). Responses were standardised for the whole population. The families and friends who cared for someone with ESLD were the focus of this analysis. In addition to describing caring, respondents reported additional support that would have been helpful. RESULTS: Of 1504 deaths reported, 145 (9.6%) were due to ESLD. The ESLD cohort were older than those with other 'expected' causes of death (> 65 years of age; 92.6% versus 70.6%; p < 0.0001) and were less likely to access specialised palliative care services (38.4% versus 61.9%; p < 0.0001). For those with ESLD, the mean caring period was significantly longer at 25 months (standard deviation (SD) 24) than for 'other diagnoses' (15 months; SD 18; p < 0.0001). Domains where additional support would have been useful included physical care, information provision, and emotional and spiritual support. CONCLUSIONS: Caregiver needs were similar regardless of the underlying diagnosis although access to palliative care specialist services occurred less often for ESLD patients. This was despite significantly longer periods of time for which care was provided.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
    • …
    corecore