34 research outputs found
Nonstochastic Bandits with Composite Anonymous Feedback
International audienceWe investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization
Nonstochastic Multiarmed Bandits with Unrestricted Delays
We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, is the number of actions and is the total delay over rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of and . For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order , where is the number of observations with delay exceeding , and is the total delay of observations with delay below . The bound relaxes to , but we also provide examples where and the oracle bound has a polynomially better dependence on the problem parameters
An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays
We propose a new algorithm for adversarial multi-armed bandits with
unrestricted delays. The algorithm is based on a novel hybrid regularizer
applied in the Follow the Regularized Leader (FTRL) framework. It achieves
regret guarantee, where is the
number of arms, is the number of rounds, and is the total delay. The
result matches the lower bound within constants and requires no prior knowledge
of or . Additionally, we propose a refined tuning of the algorithm,
which achieves
regret guarantee, where is a set of rounds excluded from delay counting,
are the counted rounds, and is the total
delay in the counted rounds. If the delays are highly unbalanced, the latter
regret guarantee can be significantly tighter than the former. The result
requires no advance knowledge of the delays and resolves an open problem of
Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime
and require no doubling, which resolves another open problem of Thune et al.
(2019)
Banker Online Mirror Descent
We propose Banker-OMD, a novel framework generalizing the classical Online
Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD
allows algorithms to robustly handle delayed feedback, and offers a general
methodology for achieving -style regret bounds
in various delayed-feedback online learning tasks, where is the time
horizon length and is the total feedback delay. We demonstrate the power of
Banker-OMD with applications to three important bandit scenarios with delayed
feedback, including delayed adversarial Multi-armed bandits (MAB), delayed
adversarial linear bandits, and a novel delayed best-of-both-worlds MAB
setting. Banker-OMD achieves nearly-optimal performance in all the three
settings. In particular, it leads to the first delayed adversarial linear
bandit algorithm achieving
regret
Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit Feedback
This paper investigates the problem of combinatorial multiarmed bandits with
stochastic submodular (in expectation) rewards and full-bandit delayed
feedback, where the delayed feedback is assumed to be composite and anonymous.
In other words, the delayed feedback is composed of components of rewards from
past actions, with unknown division among the sub-components. Three models of
delayed feedback: bounded adversarial, stochastic independent, and stochastic
conditionally independent are studied, and regret bounds are derived for each
of the delay models. Ignoring the problem dependent parameters, we show that
regret bound for all the delay models is for
time horizon , where is a delay parameter defined differently in the
three cases, thus demonstrating an additive term in regret with delay in all
the three delay models. The considered algorithm is demonstrated to outperform
other full-bandit approaches with delayed composite anonymous feedback
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not
immediately reflect on the feedback and spreads its effects over a long time
frame. For instance, in online advertising, investing in a platform produces an
instantaneous increase of awareness, but the actual reward, i.e., a conversion,
might occur far in the future. Furthermore, whether a conversion takes place
depends on: how fast the awareness grows, its vanishing effects, and the
synergy or interference with other advertising platforms. Previous work has
investigated the Multi-Armed Bandit framework with the possibility of delayed
and aggregated feedback, without a particular structure on how an action
propagates in the future, disregarding possible dynamical effects. In this
paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an
extension of the linear bandits characterized by a hidden state. When an action
is performed, the learner observes a noisy reward whose mean is a linear
function of the hidden state and of the action. Then, the hidden state evolves
according to linear dynamics, affected by the performed action too. We start by
introducing the setting, discussing the notion of optimal policy, and deriving
an expected regret lower bound. Then, we provide an optimistic regret
minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB),
that suffers an expected regret of order , where is a
measure of the stability of the system, and is the dimension of the action
vector. Finally, we conduct a numerical validation on a synthetic environment
and on real-world data to show the effectiveness of DynLin-UCB in comparison
with several baselines
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order , where is a measure of the stability of the system, and is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines