11 research outputs found

    Nonstochastic Multiarmed Bandits with Unrestricted Delays

    Get PDF
    We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, is the number of actions and is the total delay over rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of and . For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order , where is the number of observations with delay exceeding , and is the total delay of observations with delay below . The bound relaxes to , but we also provide examples where and the oracle bound has a polynomially better dependence on the problem parameters

    An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

    Full text link
    We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves O(kn+Dlog(k))\mathcal{O}(\sqrt{kn}+\sqrt{D\log(k)}) regret guarantee, where kk is the number of arms, nn is the number of rounds, and DD is the total delay. The result matches the lower bound within constants and requires no prior knowledge of nn or DD. Additionally, we propose a refined tuning of the algorithm, which achieves O(kn+minSS+DSˉlog(k))\mathcal{O}(\sqrt{kn}+\min_{S}|S|+\sqrt{D_{\bar S}\log(k)}) regret guarantee, where SS is a set of rounds excluded from delay counting, Sˉ=[n]S\bar S = [n]\setminus S are the counted rounds, and DSˉD_{\bar S} is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019)

    Gradient-free Online Learning in Games with Delayed Rewards

    Get PDF
    International audienceMotivated by applications to online advertising and recommender systems, we consider a gametheoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded

    Banker Online Mirror Descent

    Full text link
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving O~(T+D)\tilde{O}(\sqrt{T} + \sqrt{D})-style regret bounds in various delayed-feedback online learning tasks, where TT is the time horizon length and DD is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving O~(poly(n)(T+D))\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D})) regret

    Gradient-free Online Learning in Games with Delayed Rewards

    Get PDF
    Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium with probability 11, even if the delay between choosing an action and receiving the corresponding reward is unbounded.Comment: 26 pages, 4 figures; to appear in ICML 202

    Learning Adversarial Markov Decision Processes with Delayed Feedback

    Full text link
    Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode kk are revealed to the learner only in the end of episode k+dkk + d^k, where the delays dkd^k are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of K+D\sqrt{K + D} under full-information feedback, where KK is the number of episodes and D=kdkD = \sum_{k} d^k is the total delay. Under bandit feedback, we prove similar K+D\sqrt{K + D} regret assuming the costs are stochastic, and (K+D)2/3(K + D)^{2/3} regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.Comment: AAAI 202
    corecore