72 research outputs found
Delay and Cooperation in Nonstochastic Bandits
We study networks of communicating learning agents that cooperate to solve a
common nonstochastic bandit problem. Agents use an underlying communication
network to get messages about actions selected by other agents, and drop
messages that took more than hops to arrive, where is a delay
parameter. We introduce \textsc{Exp3-Coop}, a cooperative version of the {\sc
Exp3} algorithm and prove that with actions and agents the average
per-agent regret after rounds is at most of order , where is the
independence number of the -th power of the connected communication graph
. We then show that for any connected graph, for the regret
bound is , strictly better than the minimax regret
for noncooperating agents. More informed choices of lead to bounds which
are arbitrarily close to the full information minimax regret
when is dense. When has sparse components, we show that a variant of
\textsc{Exp3-Coop}, allowing agents to choose their parameters according to
their centrality in , strictly improves the regret. Finally, as a by-product
of our analysis, we provide the first characterization of the minimax regret
for bandit learning with delay.Comment: 30 page
Nonstochastic Multiarmed Bandits with Unrestricted Delays
We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that "delayed" Exp3 achieves the regret bound conjectured by Cesa-Bianchi et al. [2016] in the case of variable, but bounded delays. Here, is the number of actions and is the total delay over rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of and . For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order , where is the number of observations with delay exceeding , and is the total delay of observations with delay below . The bound relaxes to , but we also provide examples where and the oracle bound has a polynomially better dependence on the problem parameters
Delayed Bandits: When Do Intermediate Observations Help?
We study a -armed bandit with delayed feedback and intermediate
observations. We consider a model where intermediate observations have a form
of a finite state, which is observed immediately after taking an action,
whereas the loss is observed after an adversarially chosen delay. We show that
the regime of the mapping of states to losses determines the complexity of the
problem, irrespective of whether the mapping of actions to states is stochastic
or adversarial. If the mapping of states to losses is adversarial, then the
regret rate is of order (within log factors), where is the
time horizon and is a fixed delay. This matches the regret rate of a
-armed bandit with delayed feedback and without intermediate observations,
implying that intermediate observations are not helpful. However, if the
mapping of states to losses is stochastic, we show that the regret grows at a
rate of (within log factors),
implying that if the number of states is smaller than the
delay, then intermediate observations help. We also provide refined
high-probability regret upper bounds for non-uniform delays, together with
experimental validation of our algorithms
Online Learning with Switching Costs and Other Adaptive Adversaries
We study the power of different types of adaptive (nonoblivious) adversaries
in the setting of prediction with expert advice, under both full-information
and bandit feedback. We measure the player's performance using a new notion of
regret, also known as policy regret, which better captures the adversary's
adaptiveness to the player's behavior. In a setting where losses are allowed to
drift, we characterize ---in a nearly complete manner--- the power of adaptive
adversaries with bounded memories and switching costs. In particular, we show
that with switching costs, the attainable rate with bandit feedback is
. Interestingly, this rate is significantly worse
than the rate attainable with switching costs in the
full-information case. Via a novel reduction from experts to bandits, we also
show that a bounded memory adversary can force
regret even in the full information case, proving that switching costs are
easier to control than bounded memory adversaries. Our lower bounds rely on a
new stochastic adversary strategy that generates loss processes with strong
dependencies
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not
immediately reflect on the feedback and spreads its effects over a long time
frame. For instance, in online advertising, investing in a platform produces an
instantaneous increase of awareness, but the actual reward, i.e., a conversion,
might occur far in the future. Furthermore, whether a conversion takes place
depends on: how fast the awareness grows, its vanishing effects, and the
synergy or interference with other advertising platforms. Previous work has
investigated the Multi-Armed Bandit framework with the possibility of delayed
and aggregated feedback, without a particular structure on how an action
propagates in the future, disregarding possible dynamical effects. In this
paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an
extension of the linear bandits characterized by a hidden state. When an action
is performed, the learner observes a noisy reward whose mean is a linear
function of the hidden state and of the action. Then, the hidden state evolves
according to linear dynamics, affected by the performed action too. We start by
introducing the setting, discussing the notion of optimal policy, and deriving
an expected regret lower bound. Then, we provide an optimistic regret
minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB),
that suffers an expected regret of order , where is a
measure of the stability of the system, and is the dimension of the action
vector. Finally, we conduct a numerical validation on a synthetic environment
and on real-world data to show the effectiveness of DynLin-UCB in comparison
with several baselines
Dynamical Linear Bandits
In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (DynLin-UCB), that suffers an expected regret of order , where is a measure of the stability of the system, and is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of DynLin-UCB in comparison with several baselines
Regret-Minimization Algorithms for Multi-Agent Cooperative Learning Systems
A Multi-Agent Cooperative Learning (MACL) system is an artificial
intelligence (AI) system where multiple learning agents work together to
complete a common task. Recent empirical success of MACL systems in various
domains (e.g. traffic control, cloud computing, robotics) has sparked active
research into the design and analysis of MACL systems for sequential decision
making problems. One important metric of the learning algorithm for decision
making problems is its regret, i.e. the difference between the highest
achievable reward and the actual reward that the algorithm gains. The design
and development of a MACL system with low-regret learning algorithms can create
huge economic values. In this thesis, I analyze MACL systems for different
sequential decision making problems. Concretely, the Chapter 3 and 4
investigate the cooperative multi-agent multi-armed bandit problems, with
full-information or bandit feedback, in which multiple learning agents can
exchange their information through a communication network and the agents can
only observe the rewards of the actions they choose. Chapter 5 considers the
communication-regret trade-off for online convex optimization in the
distributed setting. Chapter 6 discusses how to form high-productive teams for
agents based on their unknown but fixed types using adaptive incremental
matchings. For the above problems, I present the regret lower bounds for
feasible learning algorithms and provide the efficient algorithms to achieve
this bound. The regret bounds I present in Chapter 3, 4 and 5 quantify how the
regret depends on the connectivity of the communication network and the
communication delay, thus giving useful guidance on design of the communication
protocol in MACL systemsComment: Thesis submitted to London School of Economics and Political Science
for PhD in Statistic
- …