3 research outputs found
Optimal and Greedy Algorithms for Multi-Armed Bandits with Many Arms
We characterize Bayesian regret in a stochastic multi-armed bandit problem
with a large but finite number of arms. In particular, we assume the number of
arms is , where is the time-horizon and is in
. We consider a Bayesian setting where the reward distribution of each
arm is drawn independently from a common prior, and provide a complete analysis
of expected regret with respect to this prior. Our results exhibit a sharp
distinction around . When , the fundamental lower
bound on regret is ; and it is achieved by a standard UCB algorithm.
When , the fundamental lower bound on regret is
, and it is achieved by an algorithm that first subsamples
arms uniformly at random, then runs UCB on just this subset.
Interestingly, we also find that a sufficiently large number of arms allows the
decision-maker to benefit from "free" exploration if she simply uses a greedy
algorithm. In particular, this greedy algorithm exhibits a regret of
, which translates to a {\em sublinear} (though
not optimal) regret in the time horizon. We show empirically that this is
because the greedy algorithm rapidly disposes of underperforming arms, a
beneficial trait in the many-armed regime. Technically, our analysis of the
greedy algorithm involves a novel application of the Lundberg inequality, an
upper bound for the ruin probability of a random walk; this approach may be of
independent interest
Lifelong Learning in Multi-Armed Bandits
Continuously learning and leveraging the knowledge accumulated from prior
tasks in order to improve future performance is a long standing machine
learning problem. In this paper, we study the problem in the multi-armed bandit
framework with the objective to minimize the total regret incurred over a
series of tasks. While most bandit algorithms are designed to have a low
worst-case regret, we examine here the average regret over bandit instances
drawn from some prior distribution which may change over time. We specifically
focus on confidence interval tuning of UCB algorithms. We propose a bandit over
bandit approach with greedy algorithms and we perform extensive experimental
evaluations in both stationary and non-stationary environments. We further
apply our solution to the mortal bandit problem, showing empirical improvement
over previous work
Toward Better Use of Data in Linear Bandits
In this paper, we study the well-known stochastic linear bandit problem where
a decision-maker sequentially chooses among a set of given actions, observes
their noisy reward, and aims to maximize her cumulative expected reward over a
horizon of length . In this paper, we first introduce a general analysis
framework and a family of rate optimal algorithms for the problem. We show that
this family of algorithms includes well-known algorithms such as optimism in
the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) as
special cases. The proposed analysis technique directly captures complexity of
uncertainty in the action sets that we show is tied to regret analysis of any
policy. This insight allows us to design a new rate-optimal policy, called
Sieved-Greedy (SG), that reduces the over-exploration problem in existing
algorithms. SG utilizes data to discard the actions with relatively low
uncertainty and then choosing one among the remaining actions greedily. In
addition to proving that SG is theoretically rate-optimal, our empirical
simulations show that SG significantly outperforms existing benchmarks such as
greedy, OFUL, and TS. Moreover, our analysis technique yields a number of new
results such as obtaining poly-logarithmic (in ) regret bounds for OFUL and
TS, under a generalized gap assumption and a margin condition, as in literature
on contextual bandits. We also improve regret bounds of these algorithms for
the sub-class of -armed contextual bandit problems by a factor