152 research outputs found
A Sampling-Based Method for Gittins Index Approximation
A sampling-based method is introduced to approximate the Gittins index for a
general family of alternative bandit processes. The approximation consists of a
truncation of the optimization horizon and support for the immediate rewards,
an optimal stopping value approximation, and a stochastic approximation
procedure. Finite-time error bounds are given for the three approximations,
leading to a procedure to construct a confidence interval for the Gittins index
using a finite number of Monte Carlo samples, as well as an epsilon-optimal
policy for the Bayesian multi-armed bandit. Proofs are given for almost sure
convergence and convergence in distribution for the sampling based Gittins
index approximation. In a numerical study, the approximation quality of the
proposed method is verified for the Bernoulli bandit and Gaussian bandit with
known variance, and the method is shown to significantly outperform Thompson
sampling and the Bayesian Upper Confidence Bound algorithms for a novel random
effects multi-armed bandit
Bayesian Reinforcement Learning via Deep, Sparse Sampling
We address the problem of Bayesian reinforcement learning using efficient
model-based online planning. We propose an optimism-free Bayes-adaptive
algorithm to induce deeper and sparser exploration with a theoretical bound on
its performance relative to the Bayes optimal policy, with a lower
computational complexity. The main novelty is the use of a candidate policy
generator, to generate long-term options in the planning tree (over beliefs),
which allows us to create much sparser and deeper trees. Experimental results
on different environments show that in comparison to the state-of-the-art, our
algorithm is both computationally more efficient, and obtains significantly
higher reward in discrete environments.Comment: Published in AISTATS 202
Sample-based Search Methods for Bayes-Adaptive Planning
A fundamental issue for control is acting in the face of uncertainty about the environment. Amongst other things, this induces a trade-off between exploration and exploitation. A model-based Bayesian agent optimizes its return by maintaining a posterior distribution over possible environments, and considering all possible future paths. This optimization is equivalent to solving a Markov Decision Process (MDP) whose hyperstate comprises the agent's beliefs about the environment, as well as its current state in that environment. This corresponding process is called a Bayes-Adaptive MDP (BAMDP). Even for MDPs with only a few states, it is generally intractable to solve the corresponding BAMDP exactly. Various heuristics have been devised, but those that are computationally tractable often perform indifferently, whereas those that perform well are typically so expensive as to be applicable only in small domains with limited structure. Here, we develop new tractable methods for planning in BAMDPs based on recent advances in the solution to large MDPs and general partially observable MDPs. Our algorithms are sample-based, plan online in a way that is focused on the current belief, and, critically, avoid expensive belief updates during simulations. In discrete domains, we use Monte-Carlo tree search to search forward in an aggressive manner. The derived algorithm can scale to large MDPs and provably converges to the Bayes-optimal solution asymptotically. We then consider a more general class of simulation-based methods in which approximation methods can be employed to allow value function estimates to generalize between hyperstates during search. This allows us to tackle continuous domains. We validate our approach empirically in standard domains by comparison with existing approximations. Finally, we explore Bayes-adaptive planning in environments that are modelled by rich, non-parametric probabilistic models. We demonstrate that a fully Bayesian agent can be advantageous in the exploration of complex and even infinite, structured domains
Reinforcement Learning, Bit by Bit
Reinforcement learning agents have demonstrated remarkable achievements in
simulated environments. Data efficiency poses an impediment to carrying this
success over to real environments. The design of data-efficient agents calls
for a deeper understanding of information acquisition and representation. We
develop concepts and establish a regret bound that together offer principled
guidance. The bound sheds light on questions of what information to seek, how
to seek that information, and it what information to retain. To illustrate
concepts, we design simple agents that build on them and present computational
results that demonstrate improvements in data efficiency
- …