3 research outputs found

    Optimal and Greedy Algorithms for Multi-Armed Bandits with Many Arms

    Full text link
    We characterize Bayesian regret in a stochastic multi-armed bandit problem with a large but finite number of arms. In particular, we assume the number of arms kk is TαT^{\alpha}, where TT is the time-horizon and α\alpha is in (0,1)(0,1). We consider a Bayesian setting where the reward distribution of each arm is drawn independently from a common prior, and provide a complete analysis of expected regret with respect to this prior. Our results exhibit a sharp distinction around α=1/2\alpha = 1/2. When α<1/2\alpha < 1/2, the fundamental lower bound on regret is Ω(k)\Omega(k); and it is achieved by a standard UCB algorithm. When α>1/2\alpha > 1/2, the fundamental lower bound on regret is Ω(T)\Omega(\sqrt{T}), and it is achieved by an algorithm that first subsamples T\sqrt{T} arms uniformly at random, then runs UCB on just this subset. Interestingly, we also find that a sufficiently large number of arms allows the decision-maker to benefit from "free" exploration if she simply uses a greedy algorithm. In particular, this greedy algorithm exhibits a regret of O~(max(k,T/k))\tilde{O}(\max(k,T/\sqrt{k})), which translates to a {\em sublinear} (though not optimal) regret in the time horizon. We show empirically that this is because the greedy algorithm rapidly disposes of underperforming arms, a beneficial trait in the many-armed regime. Technically, our analysis of the greedy algorithm involves a novel application of the Lundberg inequality, an upper bound for the ruin probability of a random walk; this approach may be of independent interest

    Lifelong Learning in Multi-Armed Bandits

    Full text link
    Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time. We specifically focus on confidence interval tuning of UCB algorithms. We propose a bandit over bandit approach with greedy algorithms and we perform extensive experimental evaluations in both stationary and non-stationary environments. We further apply our solution to the mortal bandit problem, showing empirical improvement over previous work

    Toward Better Use of Data in Linear Bandits

    Full text link
    In this paper, we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions, observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length TT. In this paper, we first introduce a general analysis framework and a family of rate optimal algorithms for the problem. We show that this family of algorithms includes well-known algorithms such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) as special cases. The proposed analysis technique directly captures complexity of uncertainty in the action sets that we show is tied to regret analysis of any policy. This insight allows us to design a new rate-optimal policy, called Sieved-Greedy (SG), that reduces the over-exploration problem in existing algorithms. SG utilizes data to discard the actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate-optimal, our empirical simulations show that SG significantly outperforms existing benchmarks such as greedy, OFUL, and TS. Moreover, our analysis technique yields a number of new results such as obtaining poly-logarithmic (in TT) regret bounds for OFUL and TS, under a generalized gap assumption and a margin condition, as in literature on contextual bandits. We also improve regret bounds of these algorithms for the sub-class of kk-armed contextual bandit problems by a factor k\sqrt{k}
    corecore