4,275 research outputs found
A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit
Adaptive and sequential experiment design is a well-studied area in numerous
domains. We survey and synthesize the work of the online statistical learning
paradigm referred to as multi-armed bandits integrating the existing research
as a resource for a certain class of online experiments. We first explore the
traditional stochastic model of a multi-armed bandit, then explore a taxonomic
scheme of complications to that model, for each complication relating it to a
specific requirement or consideration of the experiment design context.
Finally, at the end of the paper, we present a table of known upper-bounds of
regret for all studied algorithms providing both perspectives for future
theoretical work and a decision-making tool for practitioners looking for
theoretical guarantees.Comment: 49 pages, 1 figur
Efficient Learning in Large-Scale Combinatorial Semi-Bandits
A stochastic combinatorial semi-bandit is an online learning problem where at
each step a learning agent chooses a subset of ground items subject to
combinatorial constraints, and then observes stochastic weights of these items
and receives their sum as a payoff. In this paper, we consider efficient
learning in large-scale combinatorial semi-bandits with linear generalization,
and as a solution, propose two learning algorithms called Combinatorial Linear
Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both
algorithms are computationally efficient as long as the offline version of the
combinatorial problem can be solved efficiently. We establish that CombLinTS
and CombLinUCB are also provably statistically efficient under reasonable
assumptions, by developing regret bounds that are independent of the problem
scale (number of items) and sublinear in time. We also evaluate CombLinTS on a
variety of problems with thousands of items. Our experiment results demonstrate
that CombLinTS is scalable, robust to the choice of algorithm parameters, and
significantly outperforms the best of our baselines
Contextual Bandits with Random Projection
Contextual bandits with linear payoffs, which are also known as linear
bandits, provide a powerful alternative for solving practical problems of
sequential decisions, e.g., online advertisements. In the era of big data,
contextual data usually tend to be high-dimensional, which leads to new
challenges for traditional linear bandits mostly designed for the setting of
low-dimensional contextual data. Due to the curse of dimensionality, there are
two challenges in most of the current bandit algorithms: the first is high
time-complexity; and the second is extreme large upper regret bounds with
high-dimensional data. In this paper, in order to attack the above two
challenges effectively, we develop an algorithm of Contextual Bandits via
RAndom Projection (\texttt{CBRAP}) in the setting of linear payoffs, which
works especially for high-dimensional contextual data. The proposed
\texttt{CBRAP} algorithm is time-efficient and flexible, because it enables
players to choose an arm in a low-dimensional space, and relaxes the sparsity
assumption of constant number of non-zero components in previous work. Besides,
we provide a linear upper regret bound for the proposed algorithm, which is
associated with reduced dimensions
Small-loss bounds for online learning with partial information
We consider the problem of adversarial (non-stochastic) online learning with
partial information feedback, where at each round, a decision maker selects an
action from a finite set of alternatives. We develop a black-box approach for
such problems where the learner observes as feedback only losses of a subset of
the actions that includes the selected action. When losses of actions are
non-negative, under the graph-based feedback model introduced by Mannor and
Shamir, we offer algorithms that attain the so called "small-loss" regret bounds with high probability, where is the
independence number of the graph, and is the loss of the best
action. Prior to our work, there was no data-dependent guarantee for general
feedback graphs even for pseudo-regret (without dependence on the number of
actions, i.e. utilizing the increased information feedback). Taking advantage
of the black-box nature of our technique, we extend our results to many other
applications such as semi-bandits (including routing in networks), contextual
bandits (even with an infinite comparator class), as well as learning with
slowly changing (shifting) comparators.
In the special case of classical bandit and semi-bandit problems, we provide
optimal small-loss, high-probability guarantees of
for actual regret, where is the number of
actions, answering open questions of Neu. Previous bounds for bandits and
semi-bandits were known only for pseudo-regret and only in expectation. We also
offer an optimal regret guarantee for
fixed feedback graphs with clique-partition number at most .Comment: An extended abstract appeared in COLT 201
Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure
We introduce efficient algorithms which achieve nearly optimal regrets for
the problem of stochastic online shortest path routing with end-to-end
feedback. The setting is a natural application of the combinatorial stochastic
bandits problem, a special case of the linear stochastic bandits problem. We
show how the difficulties posed by the large scale action set can be overcome
by the networked structure of the action set. Our approach presents a novel
connection between bandit learning and shortest path algorithms. Our main
contribution is an adaptive exploration algorithm with nearly optimal
instance-dependent regret for any directed acyclic network. We then modify it
so that nearly optimal worst case regret is achieved simultaneously. Driven by
the carefully designed Top-Two Comparison (TTC) technique, the algorithms are
efficiently implementable. We further conduct extensive numerical experiments
to show that our proposed algorithms not only achieve superior regret
performances, but also reduce the runtime drastically
Cascading Bandits for Large-Scale Recommendation Problems
Most recommender systems recommend a list of items. The user examines the
list, from the first item to the last, and often chooses the first attractive
item and does not examine the rest. This type of user behavior can be modeled
by the cascade model. In this work, we study cascading bandits, an online
learning variant of the cascade model where the goal is to recommend most
attractive items from a large set of candidate items. We propose two
algorithms for solving this problem, which are based on the idea of linear
generalization. The key idea in our solutions is that we learn a predictor of
the attraction probabilities of items from their features, as opposing to
learning the attraction probability of each item independently as in the
existing work. This results in practical learning algorithms whose regret does
not depend on the number of items . We bound the regret of one algorithm and
comprehensively evaluate the other on a range of recommendation problems. The
algorithm performs well and outperforms all baselines.Comment: Accepted to UAI 201
Combinatorial Semi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with
knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited
"resources" consumed by the algorithm, e.g., limited supply in dynamic pricing.
The latter allows a huge number of actions but assumes combinatorial structure
and additional feedback to make the problem tractable. We define a common
generalization, support it with several motivating examples, and design an
algorithm for it. Our regret bounds are comparable with those for BwK and
combinatorial semi- bandits
Multi-Objective Generalized Linear Bandits
In this paper, we study the multi-objective bandits (MOB) problem, where a
learner repeatedly selects one arm to play and then receives a reward vector
consisting of multiple objectives. MOB has found many real-world applications
as varied as online recommendation and network routing. On the other hand,
these applications typically contain contextual information that can guide the
learning process which, however, is ignored by most of existing work. To
utilize this information, we associate each arm with a context vector and
assume the reward follows the generalized linear model (GLM). We adopt the
notion of Pareto regret to evaluate the learner's performance and develop a
novel algorithm for minimizing it. The essential idea is to apply a variant of
the online Newton step to estimate model parameters, based on which we utilize
the upper confidence bound (UCB) policy to construct an approximation of the
Pareto front, and then uniformly at random choose one arm from the approximate
Pareto front. Theoretical analysis shows that the proposed algorithm achieves
an Pareto regret, where is the time horizon and
is the dimension of contexts, which matches the optimal result for single
objective contextual bandits problem. Numerical experiments demonstrate the
effectiveness of our method
Semiparametric Contextual Bandits
This paper studies semiparametric contextual bandits, a generalization of the
linear stochastic bandit problem where the reward for an action is modeled as a
linear function of known action features confounded by an non-linear
action-independent term. We design new algorithms that achieve
regret over rounds, when the linear function is
-dimensional, which matches the best known bounds for the simpler
unconfounded case and improves on a recent result of Greenewald et al. (2017).
Via an empirical evaluation, we show that our algorithms outperform prior
approaches when there are non-linear confounding effects on the rewards.
Technically, our algorithms use a new reward estimator inspired by
doubly-robust approaches and our proofs require new concentration inequalities
for self-normalized martingales
Combinatorial Cascading Bandits
We propose combinatorial cascading bandits, a class of partial monitoring
problems where at each step a learning agent chooses a tuple of ground items
subject to constraints and receives a reward if and only if the weights of all
chosen items are one. The weights of the items are binary, stochastic, and
drawn independently of each other. The agent observes the index of the first
chosen item whose weight is zero. This observation model arises in network
routing, for instance, where the learning agent may only observe the first link
in the routing path which is down, and blocks the path. We propose a UCB-like
algorithm for solving our problems, CombCascade; and prove gap-dependent and
gap-free upper bounds on its -step regret. Our proofs build on recent work
in stochastic combinatorial semi-bandits but also address two novel challenges
of our setting, a non-linear reward function and partial observability. We
evaluate CombCascade on two real-world problems and show that it performs well
even when our modeling assumptions are violated. We also demonstrate that our
setting requires a new learning algorithm.Comment: Advances in Neural Information Processing Systems 2
- …