405 research outputs found
Combinatorial Bandits Revisited
This paper investigates stochastic and adversarial combinatorial multi-armed
bandit problems. In the stochastic setting under semi-bandit feedback, we
derive a problem-specific regret lower bound, and discuss its scaling with the
dimension of the decision space. We propose ESCB, an algorithm that efficiently
exploits the structure of the problem and provide a finite-time analysis of its
regret. ESCB has better performance guarantees than existing algorithms, and
significantly outperforms these algorithms in practice. In the adversarial
setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with
the same regret scaling as state-of-the-art algorithms, but with lower
computational complexity for some combinatorial problems.Comment: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS
2015
Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure
We introduce efficient algorithms which achieve nearly optimal regrets for
the problem of stochastic online shortest path routing with end-to-end
feedback. The setting is a natural application of the combinatorial stochastic
bandits problem, a special case of the linear stochastic bandits problem. We
show how the difficulties posed by the large scale action set can be overcome
by the networked structure of the action set. Our approach presents a novel
connection between bandit learning and shortest path algorithms. Our main
contribution is an adaptive exploration algorithm with nearly optimal
instance-dependent regret for any directed acyclic network. We then modify it
so that nearly optimal worst case regret is achieved simultaneously. Driven by
the carefully designed Top-Two Comparison (TTC) technique, the algorithms are
efficiently implementable. We further conduct extensive numerical experiments
to show that our proposed algorithms not only achieve superior regret
performances, but also reduce the runtime drastically
Top-k Combinatorial Bandits with Full-Bandit Feedback
Top-k Combinatorial Bandits generalize multi-armed bandits, where at each
round any subset of out of arms may be chosen and the sum of the
rewards is gained. We address the full-bandit feedback, in which the agent
observes only the sum of rewards, in contrast to the semi-bandit feedback, in
which the agent observes also the individual arms' rewards. We present the
Combinatorial Successive Accepts and Rejects (CSAR) algorithm, which
generalizes SAR (Bubeck et al, 2013) for top-k combinatorial bandits. Our main
contribution is an efficient sampling scheme that uses Hadamard matrices in
order to estimate accurately the individual arms' expected rewards. We discuss
two variants of the algorithm, the first minimizes the sample complexity and
the second minimizes the regret. We also prove a lower bound on sample
complexity, which is tight for . Finally, we run experiments and show
that our algorithm outperforms other methods
Thompson Sampling Algorithms for Cascading Bandits
Motivated by efficient optimization for online recommender systems, we
revisit the cascading bandit model proposed by Kveton et al. (2015). While
Thompson sampling (TS) algorithms have been shown to be empirically superior to
Upper Confidence Bound (UCB) algorithms for cascading bandits, theoretical
guarantees are only known for the latter, not the former. In this paper, we
close the gap by designing and analyzing a TS algorithm, TS-Cascade, that
achieves the state-of-the-art regret bound for cascading bandits. Next, we
derive a nearly matching regret lower bound, with information-theoretic
techniques and judiciously constructed cascading bandit instances. In
complement, we also provide a problem-dependent upper bound on the regret of
the Thompson sampling algorithm with Beta-Bernoulli update; this upper bound is
tighter than a recent derivation by Huyuk and Tekin (2019). Finally, we
consider a linear generalization of the cascading bandit model, which allows
efficient learning in large cascading bandit problem instances. We introduce a
TS algorithm, which enjoys a regret bound that depends on the dimension of the
linear model but not the number of items. Our paper establishes the first
theoretical guarantees on TS algorithms for stochastic combinatorial bandit
problem model with partial feedback. Numerical experiments demonstrate the
superiority of our TS algorithms compared to existing UCB algorithms.Comment: 54 pages, 3 figure
Combinatorial Semi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with
knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited
"resources" consumed by the algorithm, e.g., limited supply in dynamic pricing.
The latter allows a huge number of actions but assumes combinatorial structure
and additional feedback to make the problem tractable. We define a common
generalization, support it with several motivating examples, and design an
algorithm for it. Our regret bounds are comparable with those for BwK and
combinatorial semi- bandits
DCM Bandits: Learning to Rank with Multiple Clicks
A search engine recommends to the user a list of web pages. The user examines
this list, from the first page to the last, and clicks on all attractive pages
until the user is satisfied. This behavior of the user can be described by the
dependent click model (DCM). We propose DCM bandits, an online learning variant
of the DCM where the goal is to maximize the probability of recommending
satisfactory items, such as web pages. The main challenge of our learning
problem is that we do not observe which attractive item is satisfactory. We
propose a computationally-efficient learning algorithm for solving our problem,
dcmKL-UCB; derive gap-dependent upper bounds on its regret under reasonable
assumptions; and also prove a matching lower bound up to logarithmic factors.
We evaluate our algorithm on synthetic and real-world problems, and show that
it performs well even when our model is misspecified. This work presents the
first practical and regret-optimal online algorithm for learning to rank with
multiple clicks in a cascade-like click model.Comment: Proceedings of the 33rd International Conference on Machine Learnin
Cascading Bandits for Large-Scale Recommendation Problems
Most recommender systems recommend a list of items. The user examines the
list, from the first item to the last, and often chooses the first attractive
item and does not examine the rest. This type of user behavior can be modeled
by the cascade model. In this work, we study cascading bandits, an online
learning variant of the cascade model where the goal is to recommend most
attractive items from a large set of candidate items. We propose two
algorithms for solving this problem, which are based on the idea of linear
generalization. The key idea in our solutions is that we learn a predictor of
the attraction probabilities of items from their features, as opposing to
learning the attraction probability of each item independently as in the
existing work. This results in practical learning algorithms whose regret does
not depend on the number of items . We bound the regret of one algorithm and
comprehensively evaluate the other on a range of recommendation problems. The
algorithm performs well and outperforms all baselines.Comment: Accepted to UAI 201
Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity
Many real-world problems face the dilemma of choosing best out of
options at a given time instant. This setup can be modelled as combinatorial
bandit which chooses out of arms at each time, with an aim to achieve
an efficient tradeoff between exploration and exploitation. This is the first
work for combinatorial bandit where the reward received can be a non-linear
function of the chosen arms. The direct use of multi-armed bandit requires
choosing among -choose- options making the state space large. In this
paper, we present a novel algorithm which is computationally efficient and the
storage is linear in . The proposed algorithm is a divide-and-conquer based
strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a
regret bound of for a time
horizon , which is sub-linear in all parameters , , and . The
evaluation results on different reward functions and arm distribution functions
show significantly improved performance as compared to standard multi-armed
bandit approach with choices.Comment: 32 page
Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem
We consider the combinatorial multi-armed bandit (CMAB) problem, where the
reward function is nonlinear. In this setting, the agent chooses a batch of
arms on each round and receives feedback from each arm of the batch. The reward
that the agent aims to maximize is a function of the selected arms and their
expectations. In many applications, the reward function is highly nonlinear,
and the performance of existing algorithms relies on a global Lipschitz
constant to encapsulate the function's nonlinearity. This may lead to loose
regret bounds, since by itself, a large gradient does not necessarily cause a
large regret, but only in regions where the uncertainty in the reward's
parameters is high. To overcome this problem, we introduce a new smoothness
criterion, which we term \emph{Gini-weighted smoothness}, that takes into
account both the nonlinearity of the reward and concentration properties of the
arms. We show that a linear dependence of the regret in the batch size in
existing algorithms can be replaced by this smoothness parameter. This, in
turn, leads to much tighter regret bounds when the smoothness parameter is
batch-size independent. For example, in the probabilistic maximum coverage
(PMC) problem, that has many applications, including influence maximization,
diverse recommendations and more, we achieve dramatic improvements in the upper
bounds. We also prove matching lower bounds for the PMC problem and show that
our algorithm is tight, up to a logarithmic factor in the problem's parameters.Comment: Accepted to COLT 201
Hedging the Drift: Learning to Optimize under Non-Stationarity
We introduce data-driven decision-making algorithms that achieve
state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit
settings. These settings capture applications such as advertisement allocation,
dynamic pricing, and traffic network routing in changing environments. We show
how the difficulty posed by the (unknown \emph{a priori} and possibly
adversarial) non-stationarity can be overcome by an unconventional marriage
between stochastic and adversarial bandit learning algorithms. Our main
contribution is a general algorithmic recipe for a wide variety of
non-stationary bandit problems. Specifically, we design and analyze the sliding
window-upper confidence bound algorithm that achieves the optimal dynamic
regret bound for each of the settings when we know the respective underlying
\emph{variation budget}, which quantifies the total amount of temporal
variation of the latent environments. Boosted by the novel bandit-over-bandit
framework that adapts to the latent changes, we can further enjoy the (nearly)
optimal dynamic regret bounds in a (surprisingly) parameter-free manner. In
addition to the classical exploration-exploitation trade-off, our algorithms
leverage the power of the "forgetting principle" in the learning processes,
which is vital in changing environments. Our extensive numerical experiments on
both synthetic and real world online auto-loan datasets show that our proposed
algorithms achieve superior empirical performance compared to existing
algorithms.Comment: Journal version of the AISTATS 2019 version (available at
arXiv:1810.03024). This version fixed an error in the proof of Theorem 2 with
Assumption 4 of arXiv:2103.0575
- …