Search CORE

47 research outputs found

Bounded regret in stochastic multi-armed bandits

Author: Bubeck Sébastien
Perchet Vianney
Rigollet Philippe
Publication venue
Publication date: 01/02/2013
Field of study

We study the stochastic multi-armed bandit problem when one knows the value

\mu^{(\star)}

of an optimal arm, as a well as a positive lower bound on the smallest positive gap

\Delta

. We propose a new randomized policy that attains a regret {\em uniformly bounded over time} in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows

\Delta

, and bounded regret of order

1/\Delta

is not possible if one only knows $\mu^{(\star)}

arXiv.org e-Print Archive

Princeton University Open Access Repository

Nonparametric Stochastic Contextual Bandits

Author: Guan Melody Y.
Jiang Heinrich
Publication venue
Publication date: 05/01/2018
Field of study

We analyze the

K

-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of

\widetilde{O}\Big(T^{\frac{1+D}{2+D}}\Big)

, where

D

is the context dimension, for a modified UCB algorithm that is simple to implement (

k

NN-UCB). We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to infinite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing multi-armed bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret

Author: Hu Yichun
Kallus Nathan
Publication venue
Publication date: 05/06/2020
Field of study

Dynamic treatment regimes (DTRs) are personalized, adaptive, multi-stage treatment plans that adapt treatment decisions both to an individual's initial features and to intermediate outcomes and features at each subsequent stage, which are affected by decisions in prior stages. Examples include personalized first- and second-line treatments of chronic conditions like diabetes, cancer, and depression, which adapt to patient response to first-line treatment, disease progression, and individual characteristics. While existing literature mostly focuses on estimating the optimal DTR from offline data such as from sequentially randomized trials, we study the problem of developing the optimal DTR in an online manner, where the interaction with each individual affect both our cumulative reward and our data collection for future learning. We term this the DTR bandit problem. We propose a novel algorithm that, by carefully balancing exploration and exploitation, is guaranteed to achieve rate-optimal regret when the transition and reward models are linear. We demonstrate our algorithm and its benefits both in synthetic experiments and in a case study of adaptive treatment of major depressive disorder using real-world data

arXiv.org e-Print Archive