2 research outputs found
Randomised Bayesian Least-Squares Policy Iteration
We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy,
model-free, policy iteration algorithm that uses the Bayesian least-squares
temporal-difference (BLSTD) learning algorithm to evaluate policies. An online
variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that
improves its policy based on an incomplete policy evaluation step. In online
setting, the exploration-exploitation dilemma should be addressed as we try to
discover the optimal policy by using samples collected by ourselves. RBLSPI
exploits the advantage of BLSTD to quantify our uncertainty about the value
function. Inspired by Thompson sampling, RBLSPI first samples a value function
from a posterior distribution over value functions, and then selects actions
based on the sampled value function. The effectiveness and the exploration
abilities of RBLSPI are demonstrated experimentally in several environments.Comment: European Workshop on Reinforcement Learning 14, October 2018, Lille,
Franc
Worst-Case Regret Bounds for Exploration via Randomized Value Functions
This paper studies a recent proposal to use randomized value functions to
drive exploration in reinforcement learning. These randomized value functions
are generated by injecting random noise into the training data, making the
approach compatible with many popular methods for estimating parameterized
value functions. By providing a worst-case regret bound for tabular
finite-horizon Markov decision processes, we show that planning with respect to
these randomized value functions can induce provably efficient exploration