19 research outputs found
Attention-Privileged Reinforcement Learning
Image-based Reinforcement Learning is known to suffer from poor sample
efficiency and generalisation to unseen visuals such as distractors
(task-independent aspects of the observation space). Visual domain
randomisation encourages transfer by training over visual factors of variation
that may be encountered in the target domain. This increases learning
complexity, can negatively impact learning rate and performance, and requires
knowledge of potential variations during deployment. In this paper, we
introduce Attention-Privileged Reinforcement Learning (APRiL) which uses a
self-supervised attention mechanism to significantly alleviate these drawbacks:
by focusing on task-relevant aspects of the observations, attention provides
robustness to distractors as well as significantly increased learning
efficiency. APRiL trains two attention-augmented actor-critic agents: one
purely based on image observations, available across training and transfer
domains; and one with access to privileged information (such as environment
states) available only during training. Experience is shared between both
agents and their attention mechanisms are aligned. The image-based policy can
then be deployed without access to privileged information. We experimentally
demonstrate accelerated and more robust learning on a diverse set of domains,
leading to improved final performance for environments both within and outside
the training distribution.Comment: Published at Conference on Robot Learning (CoRL) 202
Ranking Policy Decisions
Policies trained via Reinforcement Learning (RL) are often needlessly
complex, making them difficult to analyse and interpret. In a run with time
steps, a policy will make decisions on actions to take; we conjecture that
only a small subset of these decisions delivers value over selecting a simple
default action. Given a trained policy, we propose a novel black-box method
based on statistical fault localisation that ranks the states of the
environment according to the importance of decisions made in those states. We
argue that among other things, the ranked list of states can help explain and
understand the policy. As the ranking method is statistical, a direct
evaluation of its quality is hard. As a proxy for quality, we use the ranking
to create new, simpler policies from the original ones by pruning decisions
identified as unimportant (that is, replacing them by default actions) and
measuring the impact on performance. Our experiments on a diverse set of
standard benchmarks demonstrate that pruned policies can perform on a level
comparable to the original policies. Conversely, we show that naive approaches
for ranking policy decisions, e.g., ranking based on the frequency of visiting
a state, do not result in high-performing pruned policies
Specifying and Interpreting Reinforcement Learning Policies through Simulatable Machine Learning
Human-AI collaborative policy synthesis is a procedure in which (1) a human
initializes an autonomous agent's behavior, (2) Reinforcement Learning improves
the human specified behavior, and (3) the agent can explain the final optimized
policy to the user. This paradigm leverages human expertise and facilitates a
greater insight into the learned behaviors of an agent. Existing approaches to
enabling collaborative policy specification involve black box methods which are
unintelligible and are not catered towards non-expert end-users. In this paper,
we develop a novel collaborative framework to enable humans to initialize and
interpret an autonomous agent's behavior, rooted in principles of
human-centered design. Through our framework, we enable humans to specify an
initial behavior model in the form of unstructured, natural language, which we
then convert to lexical decision trees. Next, we are able to leverage these
human-specified policies, to warm-start reinforcement learning and further
allow the agent to optimize the policies through reinforcement learning.
Finally, to close the loop on human-specification, we produce explanations of
the final learned policy, in multiple modalities, to provide the user a final
depiction about the learned policy of the agent. We validate our approach by
showing that our model can produce >80% accuracy, and that human-initialized
policies are able to successfully warm-start RL. We then conduct a novel
human-subjects study quantifying the relative subjective and objective benefits
of varying XAI modalities(e.g., Tree, Language, and Program) for explaining
learned policies to end-users, in terms of usability and interpretability and
identify the circumstances that influence these measures. Our findings
emphasize the need for personalized explainable systems that can facilitate
user-centric policy explanations for a variety of end-users