1 research outputs found
Improved Exploration in Factored Average-Reward MDPs
We consider a regret minimization task under the average-reward criterion in
an unknown Factored Markov Decision Process (FMDP). More specifically, we
consider an FMDP where the state-action space and the state-space
admit the respective factored forms of and ,
and the transition and reward functions are factored over and
. Assuming known factorization structure, we introduce a novel
regret minimization strategy inspired by the popular UCRL2 strategy, called
DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual
elements of the transition function. We show that for a generic factorization
structure, DBN-UCRL achieves a regret bound, whose leading term strictly
improves over existing regret bounds in terms of the dependencies on the size
of 's and the involved diameter-related terms. We further show
that when the factorization structure corresponds to the Cartesian product of
some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of
the base MDPs. We demonstrate, through numerical experiments on standard
environments, that DBN-UCRL enjoys a substantially improved regret empirically
over existing algorithms