61 research outputs found
Experimental results : Reinforcement Learning of POMDPs using Spectral Methods
We propose a new reinforcement learning algorithm for partially observable
Markov decision processes (POMDP) based on spectral decomposition methods.
While spectral methods have been previously employed for consistent learning of
(passive) latent variable models such as hidden Markov models, POMDPs are more
challenging since the learner interacts with the environment and possibly
changes the future observations in the process. We devise a learning algorithm
running through epochs, in each epoch we employ spectral techniques to learn
the POMDP parameters from a trajectory generated by a fixed policy. At the end
of the epoch, an optimization oracle returns the optimal memoryless planning
policy which maximizes the expected reward based on the estimated POMDP model.
We prove an order-optimal regret bound with respect to the optimal memoryless
policy and efficient scaling with respect to the dimensionality of observation
and action spaces.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016),
Barcelona, Spai
Monte Carlo Bayesian Reinforcement Learning
Bayesian reinforcement learning (BRL) encodes prior knowledge of the world in
a model and represents uncertainty in model parameters by maintaining a
probability distribution over them. This paper presents Monte Carlo BRL
(MC-BRL), a simple and general approach to BRL. MC-BRL samples a priori a
finite set of hypotheses for the model parameter values and forms a discrete
partially observable Markov decision process (POMDP) whose state space is a
cross product of the state space for the reinforcement learning task and the
sampled model parameter space. The POMDP does not require conjugate
distributions for belief representation, as earlier works do, and can be solved
relatively easily with point-based approximation algorithms. MC-BRL naturally
handles both fully and partially observable worlds. Theoretical and
experimental results show that the discrete POMDP approximates the underlying
BRL task well with guaranteed performance.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs
Centaurs are half-human, half-AI decision-makers where the AI's goal is to complement the human. To do so, the AI must be able to recognize the goals and constraints of the human and have the means to help them. We present a novel formulation of the interaction between the human and the AI as a sequential game where the agents are modelled using Bayesian best-response models. We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP. In our simulated experiments, we consider an instantiation of our framework for humans who are subjectively optimistic about the AI's future behaviour. Our results show that when equipped with a model of the human, the AI can infer the human's bounds and nudge them towards better decisions. We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human. We identify a novel trade-off for centaurs in partially observable tasks: for the AI's actions to be acceptable to the human, the machine must make sure their beliefs are sufficiently aligned, but aligning beliefs might be costly. We present a preliminary theoretical analysis of this trade-off and its dependence on task structure
- …