Search CORE

3 research outputs found

Optimal and Greedy Algorithms for Multi-Armed Bandits with Many Arms

Author: Bayati Mohsen
Hamidi Nima
Johari Ramesh
Khosravi Khashayar
Publication venue
Publication date: 24/02/2020
Field of study

We characterize Bayesian regret in a stochastic multi-armed bandit problem with a large but finite number of arms. In particular, we assume the number of arms

k

T^{\alpha}

, where

T

is the time-horizon and

\alpha

is in

(0,1)

. We consider a Bayesian setting where the reward distribution of each arm is drawn independently from a common prior, and provide a complete analysis of expected regret with respect to this prior. Our results exhibit a sharp distinction around

\alpha = 1/2

. When

\alpha < 1/2

, the fundamental lower bound on regret is

\Omega(k)

; and it is achieved by a standard UCB algorithm. When

\alpha > 1/2

, the fundamental lower bound on regret is

\Omega(\sqrt{T})

, and it is achieved by an algorithm that first subsamples

\sqrt{T}

arms uniformly at random, then runs UCB on just this subset. Interestingly, we also find that a sufficiently large number of arms allows the decision-maker to benefit from "free" exploration if she simply uses a greedy algorithm. In particular, this greedy algorithm exhibits a regret of

\tilde{O}(\max(k,T/\sqrt{k}))

, which translates to a {\em sublinear} (though not optimal) regret in the time horizon. We show empirically that this is because the greedy algorithm rapidly disposes of underperforming arms, a beneficial trait in the many-armed regime. Technically, our analysis of the greedy algorithm involves a novel application of the Lundberg inequality, an upper bound for the ruin probability of a random walk; this approach may be of independent interest

arXiv.org e-Print Archive

Lifelong Learning in Multi-Armed Bandits

Author: Jedor Matthieu
Louëdec Jonathan
Perchet Vianney
Publication venue
Publication date: 28/12/2020
Field of study

Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time. We specifically focus on confidence interval tuning of UCB algorithms. We propose a bandit over bandit approach with greedy algorithms and we perform extensive experimental evaluations in both stationary and non-stationary environments. We further apply our solution to the mortal bandit problem, showing empirical improvement over previous work

arXiv.org e-Print Archive

Toward Better Use of Data in Linear Bandits

Author: Bayati Mohsen
Hamidi Nima
Publication venue
Publication date: 18/03/2021
Field of study

In this paper, we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions, observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length

T

. In this paper, we first introduce a general analysis framework and a family of rate optimal algorithms for the problem. We show that this family of algorithms includes well-known algorithms such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) as special cases. The proposed analysis technique directly captures complexity of uncertainty in the action sets that we show is tied to regret analysis of any policy. This insight allows us to design a new rate-optimal policy, called Sieved-Greedy (SG), that reduces the over-exploration problem in existing algorithms. SG utilizes data to discard the actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate-optimal, our empirical simulations show that SG significantly outperforms existing benchmarks such as greedy, OFUL, and TS. Moreover, our analysis technique yields a number of new results such as obtaining poly-logarithmic (in

T

) regret bounds for OFUL and TS, under a generalized gap assumption and a margin condition, as in literature on contextual bandits. We also improve regret bounds of these algorithms for the sub-class of

k

-armed contextual bandit problems by a factor

\sqrt{k}

arXiv.org e-Print Archive