Search CORE

34 research outputs found

Nonstochastic Bandits with Composite Anonymous Feedback

Author: Cesa-Bianchi Nicolo
Gentile Claudio
Mansour Yishay
Publication venue: HAL CCSD
Publication date: 05/07/2018
Field of study

International audienceWe investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Nonstochastic Multiarmed Bandits with Unrestricted Delays

Author: N. Cesa-Bianchi
T.S. Thune
Y. Seldin
Publication venue: Curran Associates
Publication date: 01/01/2019
Field of study

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, is the number of actions and is the total delay over rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of and . For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order , where is the number of observations with delay exceeding , and is the total delay of observations with delay below . The bound relaxes to , but we also provide examples where and the oracle bound has a polynomially better dependence on the problem parameters

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

Copenhagen University Research Information System

An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

Author: Seldin Yevgeny
Zimmert Julian
Publication venue
Publication date: 01/01/2020
Field of study

We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves

\mathcal{O}(\sqrt{kn}+\sqrt{D\log(k)})

regret guarantee, where

k

is the number of arms,

n

is the number of rounds, and

D

is the total delay. The result matches the lower bound within constants and requires no prior knowledge of

n

D

. Additionally, we propose a refined tuning of the algorithm, which achieves

\mathcal{O}(\sqrt{kn}+\min_{S}|S|+\sqrt{D_{\bar S}\log(k)})

regret guarantee, where

S

is a set of rounds excluded from delay counting,

\bar S = [n]\setminus S

are the counted rounds, and

D_{\bar S}

is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019)

arXiv.org e-Print Archive

Copenhagen University Research Information System

Banker Online Mirror Descent

Author: Huang Jiatai
Huang Longbo
Publication venue
Publication date: 16/06/2021
Field of study

We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving

\tilde{O}(\sqrt{T} + \sqrt{D})

-style regret bounds in various delayed-feedback online learning tasks, where

T

is the time horizon length and

D

is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving

\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))

regret

arXiv.org e-Print Archive

Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit Feedback

Author: Aggarwal Vaneet
Pedramfar Mohammad
Publication venue
Publication date: 23/03/2023
Field of study

This paper investigates the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and full-bandit delayed feedback, where the delayed feedback is assumed to be composite and anonymous. In other words, the delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components. Three models of delayed feedback: bounded adversarial, stochastic independent, and stochastic conditionally independent are studied, and regret bounds are derived for each of the delay models. Ignoring the problem dependent parameters, we show that regret bound for all the delay models is

\tilde{O}(T^{2/3} + T^{1/3} \nu)

for time horizon

T

, where

\nu

is a delay parameter defined differently in the three cases, thus demonstrating an additive term in regret with delay in all the three delay models. The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback

arXiv.org e-Print Archive

Dynamical Linear Bandits

Author: Metelli Alberto Maria
Mussi Marco
Restelli Marcello
Publication venue
Publication date: 30/05/2023
Field of study

In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order

\widetilde{\mathcal{O}} \Big( \frac{d \sqrt{T}}{(1-\overline{\rho})^{3/2}} \Big)

, where

\overline{\rho}

is a measure of the stability of the system, and

d

is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines

arXiv.org e-Print Archive

Dynamical Linear Bandits

Author: Metelli Alberto Maria
Mussi Marco
Restelli Marcello
Publication venue: PMLR
Publication date: 01/01/2023
Field of study

\widetilde{\mathcal{O}} \Big( \frac{d \sqrt{T}}{(1-\overline{\rho})^{3/2}} \Big)

, where

\overline{\rho}

is a measure of the stability of the system, and

d

Archivio istituzionale della ricerca - Politecnico di Milano