8,079 research outputs found
Memory Bounded Open-Loop Planning in Large POMDPs using Thompson Sampling
State-of-the-art approaches to partially observable planning like POMCP are
based on stochastic tree search. While these approaches are computationally
efficient, they may still construct search trees of considerable size, which
could limit the performance due to restricted memory resources. In this paper,
we propose Partially Observable Stacked Thompson Sampling (POSTS), a memory
bounded approach to open-loop planning in large POMDPs, which optimizes a fixed
size stack of Thompson Sampling bandits. We empirically evaluate POSTS in four
large benchmark problems and compare its performance with different tree-based
approaches. We show that POSTS achieves competitive performance compared to
tree-based open-loop planning and offers a performance-memory tradeoff, making
it suitable for partially observable planning with highly restricted
computational and memory resources.Comment: Presented at AAAI 201
Near-Optimal BRL using Optimistic Local Transitions
Model-based Bayesian Reinforcement Learning (BRL) allows a found
formalization of the problem of acting optimally while facing an unknown
environment, i.e., avoiding the exploration-exploitation dilemma. However,
algorithms explicitly addressing BRL suffer from such a combinatorial explosion
that a large body of work relies on heuristic algorithms. This paper introduces
BOLT, a simple and (almost) deterministic heuristic algorithm for BRL which is
optimistic about the transition function. We analyze BOLT's sample complexity,
and show that under certain parameters, the algorithm is near-optimal in the
Bayesian sense with high probability. Then, experimental results highlight the
key differences of this method compared to previous work.Comment: ICML201
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
We derive sublinear regret bounds for undiscounted reinforcement learning in
continuous state space. The proposed algorithm combines state aggregation with
the use of upper confidence bounds for implementing optimism in the face of
uncertainty. Beside the existence of an optimal policy which satisfies the
Poisson equation, the only assumptions made are Holder continuity of rewards
and transition probabilities
- …