Search CORE

11 research outputs found

Nonstochastic Multiarmed Bandits with Unrestricted Delays

Author: N. Cesa-Bianchi
T.S. Thune
Y. Seldin
Publication venue: Curran Associates
Publication date: 01/01/2019
Field of study

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, is the number of actions and is the total delay over rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of and . For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order , where is the number of observations with delay exceeding , and is the total delay of observations with delay below . The bound relaxes to , but we also provide examples where and the oracle bound has a polynomially better dependence on the problem parameters

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

Copenhagen University Research Information System

An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

Author: Seldin Yevgeny
Zimmert Julian
Publication venue
Publication date: 01/01/2020
Field of study

We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves

\mathcal{O}(\sqrt{kn}+\sqrt{D\log(k)})

regret guarantee, where

k

is the number of arms,

n

is the number of rounds, and

D

is the total delay. The result matches the lower bound within constants and requires no prior knowledge of

n

D

. Additionally, we propose a refined tuning of the algorithm, which achieves

\mathcal{O}(\sqrt{kn}+\min_{S}|S|+\sqrt{D_{\bar S}\log(k)})

regret guarantee, where

S

is a set of rounds excluded from delay counting,

\bar S = [n]\setminus S

are the counted rounds, and

D_{\bar S}

is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019)

arXiv.org e-Print Archive

Copenhagen University Research Information System

Gradient-free Online Learning in Games with Delayed Rewards

Author: Héliou Amélie
Mertikopoulos Panayotis
Zhou Zhengyuan
Publication venue: HAL CCSD
Publication date: 01/01/2020
Field of study

International audienceMotivated by applications to online advertising and recommender systems, we consider a gametheoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded

INRIA a CCSD electronic archive server

Banker Online Mirror Descent

Author: Huang Jiatai
Huang Longbo
Publication venue
Publication date: 16/06/2021
Field of study

We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving

\tilde{O}(\sqrt{T} + \sqrt{D})

-style regret bounds in various delayed-feedback online learning tasks, where

T

is the time horizon length and

D

is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving

\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))

regret

arXiv.org e-Print Archive

Gradient-free Online Learning in Games with Delayed Rewards

Author: Héliou Amélie
Mertikopoulos Panayotis
Zhou Zhengyuan
Publication venue
Publication date: 01/01/2020
Field of study

Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium with probability

1

, even if the delay between choosing an action and receiving the corresponding reward is unbounded.Comment: 26 pages, 4 figures; to appear in ICML 202

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL Descartes

Learning Adversarial Markov Decision Processes with Delayed Feedback

Author: Lancewicki Tal
Mansour Yishay
Rosenberg Aviv
Publication venue
Publication date: 15/12/2021
Field of study

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode

k

are revealed to the learner only in the end of episode

k + d^k

, where the delays

d^k

are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of

\sqrt{K + D}

under full-information feedback, where

K

is the number of episodes and

D = \sum_{k} d^k

is the total delay. Under bandit feedback, we prove similar

\sqrt{K + D}

regret assuming the costs are stochastic, and

(K + D)^{2/3}

regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications