11 research outputs found
Multi-Armed Bandits for Correlated Markovian Environments with Smoothed Reward Feedback
We study a multi-armed bandit problem in a dynamic environment where arm
rewards evolve in a correlated fashion according to a Markov chain. Different
than much of the work on related problems, in our formulation a learning
algorithm does not have access to either a priori information or observations
of the state of the Markov chain and only observes smoothed reward feedback
following time intervals we refer to as epochs. We demonstrate that existing
methods such as UCB and -greedy can suffer linear regret in such
an environment. Employing mixing-time bounds on Markov chains, we develop
algorithms called EpochUCB and EpochGreedy that draw inspiration from the
aforementioned methods, yet which admit sublinear regret guarantees for the
problem formulation. Our proposed algorithms proceed in epochs in which an arm
is played repeatedly for a number of iterations that grows linearly as a
function of the number of times an arm has been played in the past. We analyze
these algorithms under two types of smoothed reward feedback at the end of each
epoch: a reward that is the discount-average of the discounted rewards within
an epoch, and a reward that is the time-average of the rewards within an epoch.Comment: Significant revision of prior version including deeper discussion of
related work, gap-independent regret bounds, and regret bounds for discounted
reward
DĂ©compensation au cours de l'insuffisance respiratoire chronique de l'enfant
International audienc
Optimisation de la prise en charge pré-hospitalière des plaies hémorragiques de membre de l'enfant
TOULOUSE3-BU Santé-Centrale (315552105) / SudocSudocFranceF