11 research outputs found
Nonstochastic Multiarmed Bandits with Unrestricted Delays
We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, is the number of actions and is the total delay over rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of and . For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order , where is the number of observations with delay exceeding , and is the total delay of observations with delay below . The bound relaxes to , but we also provide examples where and the oracle bound has a polynomially better dependence on the problem parameters
An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays
We propose a new algorithm for adversarial multi-armed bandits with
unrestricted delays. The algorithm is based on a novel hybrid regularizer
applied in the Follow the Regularized Leader (FTRL) framework. It achieves
regret guarantee, where is the
number of arms, is the number of rounds, and is the total delay. The
result matches the lower bound within constants and requires no prior knowledge
of or . Additionally, we propose a refined tuning of the algorithm,
which achieves
regret guarantee, where is a set of rounds excluded from delay counting,
are the counted rounds, and is the total
delay in the counted rounds. If the delays are highly unbalanced, the latter
regret guarantee can be significantly tighter than the former. The result
requires no advance knowledge of the delays and resolves an open problem of
Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime
and require no doubling, which resolves another open problem of Thune et al.
(2019)
Gradient-free Online Learning in Games with Delayed Rewards
International audienceMotivated by applications to online advertising and recommender systems, we consider a gametheoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded
Banker Online Mirror Descent
We propose Banker-OMD, a novel framework generalizing the classical Online
Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD
allows algorithms to robustly handle delayed feedback, and offers a general
methodology for achieving -style regret bounds
in various delayed-feedback online learning tasks, where is the time
horizon length and is the total feedback delay. We demonstrate the power of
Banker-OMD with applications to three important bandit scenarios with delayed
feedback, including delayed adversarial Multi-armed bandits (MAB), delayed
adversarial linear bandits, and a novel delayed best-of-both-worlds MAB
setting. Banker-OMD achieves nearly-optimal performance in all the three
settings. In particular, it leads to the first delayed adversarial linear
bandit algorithm achieving
regret
Gradient-free Online Learning in Games with Delayed Rewards
Motivated by applications to online advertising and recommender systems, we
consider a game-theoretic model with delayed rewards and asynchronous,
payoff-based feedback. In contrast to previous work on delayed multi-armed
bandits, we focus on multi-player games with continuous action spaces, and we
examine the long-run behavior of strategic agents that follow a no-regret
learning policy (but are otherwise oblivious to the game being played, the
objectives of their opponents, etc.). To account for the lack of a consistent
stream of information (for instance, rewards can arrive out of order, with an a
priori unbounded delay, etc.), we introduce a gradient-free learning policy
where payoff information is placed in a priority queue as it arrives. In this
general context, we derive new bounds for the agents' regret; furthermore,
under a standard diagonal concavity assumption, we show that the induced
sequence of play converges to Nash equilibrium with probability , even if
the delay between choosing an action and receiving the corresponding reward is
unbounded.Comment: 26 pages, 4 figures; to appear in ICML 202
Learning Adversarial Markov Decision Processes with Delayed Feedback
Reinforcement learning typically assumes that agents observe feedback for
their actions immediately, but in many real-world applications (like
recommendation systems) feedback is observed in delay. This paper studies
online learning in episodic Markov decision processes (MDPs) with unknown
transitions, adversarially changing costs and unrestricted delayed feedback.
That is, the costs and trajectory of episode are revealed to the learner
only in the end of episode , where the delays are neither
identical nor bounded, and are chosen by an oblivious adversary. We present
novel algorithms based on policy optimization that achieve near-optimal
high-probability regret of under full-information feedback,
where is the number of episodes and is the total delay.
Under bandit feedback, we prove similar regret assuming the
costs are stochastic, and regret in the general case. We are
the first to consider regret minimization in the important setting of MDPs with
delayed feedback.Comment: AAAI 202