199 research outputs found
TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning
Combining deep model-free reinforcement learning with on-line planning is a
promising approach to building on the successes of deep RL. On-line planning
with look-ahead trees has proven successful in environments where transition
models are known a priori. However, in complex environments where transition
models need to be learned from data, the deficiencies of learned models have
limited their utility for planning. To address these challenges, we propose
TreeQN, a differentiable, recursive, tree-structured model that serves as a
drop-in replacement for any value function network in deep RL with discrete
actions. TreeQN dynamically constructs a tree by recursively applying a
transition model in a learned abstract state space and then aggregating
predicted rewards and state-values using a tree backup to estimate Q-values. We
also propose ATreeC, an actor-critic variant that augments TreeQN with a
softmax layer to form a stochastic policy network. Both approaches are trained
end-to-end, such that the learned model is optimised for its actual use in the
tree. We show that TreeQN and ATreeC outperform n-step DQN and A2C on a
box-pushing task, as well as n-step DQN and value prediction networks (Oh et
al. 2017) on multiple Atari games. Furthermore, we present ablation studies
that demonstrate the effect of different auxiliary losses on learning
transition models
Counterfactual Multi-Agent Policy Gradients
Cooperative multi-agent systems can be naturally used to model many real
world problems, such as network packet routing and the coordination of
autonomous vehicles. There is a great need for new reinforcement learning
methods that can efficiently learn decentralised policies for such systems. To
this end, we propose a new multi-agent actor-critic method called
counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised
critic to estimate the Q-function and decentralised actors to optimise the
agents' policies. In addition, to address the challenges of multi-agent credit
assignment, it uses a counterfactual baseline that marginalises out a single
agent's action, while keeping the other agents' actions fixed. COMA also uses a
critic representation that allows the counterfactual baseline to be computed
efficiently in a single forward pass. We evaluate COMA in the testbed of
StarCraft unit micromanagement, using a decentralised variant with significant
partial observability. COMA significantly improves average performance over
other multi-agent actor-critic methods in this setting, and the best performing
agents are competitive with state-of-the-art centralised controllers that get
access to the full state
An Investigation of the Bias-Variance Tradeoff in Meta-Gradients
Meta-gradients provide a general approach for optimizing the meta-parameters
of reinforcement learning (RL) algorithms. Estimation of meta-gradients is
central to the performance of these meta-algorithms, and has been studied in
the setting of MAML-style short-horizon meta-RL problems. In this context,
prior work has investigated the estimation of the Hessian of the RL objective,
as well as tackling the problem of credit assignment to pre-adaptation behavior
by making a sampling correction. However, we show that Hessian estimation,
implemented for example by DiCE and its variants, always adds bias and can also
add variance to meta-gradient estimation. Meanwhile, meta-gradient estimation
has been studied less in the important long-horizon setting, where
backpropagation through the full inner optimization trajectories is not
feasible. We study the bias and variance tradeoff arising from truncated
backpropagation and sampling correction, and additionally compare to evolution
strategies, which is a recently popular alternative strategy to long-horizon
meta-learning. While prior work implicitly chooses points in this bias-variance
space, we disentangle the sources of bias and variance and present an empirical
study that relates existing estimators to each other
Growing Action Spaces
In complex tasks, such as those with large combinatorial action spaces,
random exploration may be too inefficient to achieve meaningful learning
progress. In this work, we use a curriculum of progressively growing action
spaces to accelerate learning. We assume the environment is out of our control,
but that the agent may set an internal curriculum by initially restricting its
action space. Our approach uses off-policy reinforcement learning to estimate
optimal value functions for multiple action spaces simultaneously and
efficiently transfers data, value estimates, and state representations from
restricted action spaces to the full task. We show the efficacy of our approach
in proof-of-concept control tasks and on challenging large-scale StarCraft
micromanagement tasks with large, multi-agent action spaces
Caregivers' perceived adequacy of support in end-stage lung disease: results of a population survey.
BACKGROUND: End-stage lung disease (ESLD) is a frequent cause of death. What are the differences in the supports needed by caregivers of individuals with ESLD at end of life versus other life-limiting diagnoses? METHODS: The South Australian Health Omnibus is an annual, random, face-to-face, cross-sectional survey. In 2002, 2003 and 2005-2007, respondents were asked a range of questions about end-of-life care; there were approximately 3000 survey participants annually (participation rate 77.9%). Responses were standardised for the whole population. The families and friends who cared for someone with ESLD were the focus of this analysis. In addition to describing caring, respondents reported additional support that would have been helpful. RESULTS: Of 1504 deaths reported, 145 (9.6%) were due to ESLD. The ESLD cohort were older than those with other 'expected' causes of death (> 65 years of age; 92.6% versus 70.6%; p < 0.0001) and were less likely to access specialised palliative care services (38.4% versus 61.9%; p < 0.0001). For those with ESLD, the mean caring period was significantly longer at 25 months (standard deviation (SD) 24) than for 'other diagnoses' (15 months; SD 18; p < 0.0001). Domains where additional support would have been useful included physical care, information provision, and emotional and spiritual support. CONCLUSIONS: Caregiver needs were similar regardless of the underlying diagnosis although access to palliative care specialist services occurred less often for ESLD patients. This was despite significantly longer periods of time for which care was provided.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
- …