4 research outputs found
A Notation for Markov Decision Processes
This paper specifies a notation for Markov decision processes
Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines
We show how an action-dependent baseline can be used by the policy gradient
theorem using function approximation, originally presented with
action-independent baselines by (Sutton et al. 2000)
TripleTree: A Versatile Interpretable Representation of Black Box Agents and their Environments
In explainable artificial intelligence, there is increasing interest in
understanding the behaviour of autonomous agents to build trust and validate
performance. Modern agent architectures, such as those trained by deep
reinforcement learning, are currently so lacking in interpretable structure as
to effectively be black boxes, but insights may still be gained from an
external, behaviourist perspective. Inspired by conceptual spaces theory, we
suggest that a versatile first step towards general understanding is to
discretise the state space into convex regions, jointly capturing similarities
over the agent's action, value function and temporal dynamics within a dataset
of observations. We create such a representation using a novel variant of the
CART decision tree algorithm, and demonstrate how it facilitates practical
understanding of black box agents through prediction, visualisation and
rule-based explanation.Comment: 12 pages (incl. references and appendices), 15 figures. Pre-print,
under revie
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a
reinforcement learning policy given historical data that may have been
generated by a different policy. The ability to evaluate a policy from
historical data is important for applications where the deployment of a bad
policy can be dangerous or costly. We show empirically that our algorithm
produces estimates that often have orders of magnitude lower mean squared error
than existing methods---it makes more efficient use of the available data. Our
new estimator is based on two advances: an extension of the doubly robust
estimator (Jiang and Li, 2015), and a new way to mix between model based
estimates and importance sampling based estimates