43 research outputs found

    Ballooning Multi-Armed Bandits

    Full text link
    In this paper, we introduce Ballooning Multi-Armed Bandits (BL-MAB), a novel extension of the classical stochastic MAB model. In the BL-MAB model, the set of available arms grows (or balloons) over time. In contrast to the classical MAB setting where the regret is computed with respect to the best arm overall, the regret in a BL-MAB setting is computed with respect to the best available arm at each time. We first observe that the existing stochastic MAB algorithms result in linear regret for the BL-MAB model. We prove that, if the best arm is equally likely to arrive at any time instant, a sub-linear regret cannot be achieved. Next, we show that if the best arm is more likely to arrive in the early rounds, one can achieve sub-linear regret. Our proposed algorithm determines (1) the fraction of the time horizon for which the newly arriving arms should be explored and (2) the sequence of arm pulls in the exploitation phase from among the explored arms. Making reasonable assumptions on the arrival distribution of the best arm in terms of the thinness of the distribution's tail, we prove that the proposed algorithm achieves sub-linear instance-independent regret. We further quantify explicit dependence of regret on the arrival distribution parameters. We reinforce our theoretical findings with extensive simulation results. We conclude by showing that our algorithm would achieve sub-linear regret even if (a) the distributional parameters are not exactly known, but are obtained using a reasonable learning mechanism or (b) the best arm is not more likely to arrive early, but a large fraction of arms is likely to arrive relatively early.Comment: A full version of this paper is accepted in the Journal of Artificial Intelligence (AIJ) of Elsevier. A preliminary version is published as an extended abstract in AAMAS 2020. Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems. 202

    Monitoring and control of stochastic systems

    Get PDF

    Optimal Learning with Non-Gaussian Rewards

    Get PDF
    In this disseration, the author studies sequential Bayesian learning problems modeled under non-Gaussian distributions. We focus on a class of problems called the multi-armed bandit problem, and studies its optimal learning strategy, the Gittins index policy. The Gittins index is computationally intractable and approxi- mation methods have been developed for Gaussian reward problems. We construct a novel theoretical and computational framework for the Gittins index under non- Gaussian rewards. By interpolating the rewards using continuous-time conditional Levy processes, we recast the optimal stopping problems that characterize Gittins indices into free-boundary partial integro-differential equations (PIDEs). We also provide additional structural properties and numerical illustrations on how our ap- proach can be used to approximate the Gittins index

    COMPETING AGAINST ADAPTIVE AGENTS BY MINIMIZING COUNTERFACTUAL NOTIONS OF REGRET

    Get PDF
    Online learning or sequential decision making is formally defined as a repeated game between an adversary and a player. At every round of the game the player chooses an action from a fixed action set and the adversary reveals a reward/loss for the action played. The goal of the player is to maximize the cumulative reward of her actions. The rewards/losses could be sampled from an unknown distribution or other less restrictive assumptions can be made. The standard measure of performance is the cumulative regret, that is the difference between the cumulative reward of the player and the best achievable reward by a fixed action, or more generally a fixed policy, on the observed reward sequence. For adversaries which are oblivious to the player's strategy, regret is a meaningful measure. However, the adversary is usually adaptive, e.g., in healthcare a patient will respond to given treatments, and for self-driving cars other traffic will react to the behavior of the autonomous agent. In such settings the notion of regret is hard to interpret as the best action in hindsight might not be the best action overall, given the behavior of the adversary. To resolve this problem a new notion called policy regret is introduced. Policy regret is fundamentally different from other forms of regret as it is counterfactual in nature, i.e., the player competes against all other policies whose reward is calculated by taking into account how the adversary would have behaved had the player chosen another policy. This thesis studies policy regret in a partial (bandit) feedback environment, beyond the worst case setting, by leveraging additional structure such as stochasticity/stability of the adversary or additional feedback
    corecore