2,274 research outputs found
Combinatorial Bandits Revisited
This paper investigates stochastic and adversarial combinatorial multi-armed
bandit problems. In the stochastic setting under semi-bandit feedback, we
derive a problem-specific regret lower bound, and discuss its scaling with the
dimension of the decision space. We propose ESCB, an algorithm that efficiently
exploits the structure of the problem and provide a finite-time analysis of its
regret. ESCB has better performance guarantees than existing algorithms, and
significantly outperforms these algorithms in practice. In the adversarial
setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with
the same regret scaling as state-of-the-art algorithms, but with lower
computational complexity for some combinatorial problems.Comment: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS
2015
Sequential Monte Carlo Bandits
In this paper we propose a flexible and efficient framework for handling
multi-armed bandits, combining sequential Monte Carlo algorithms with
hierarchical Bayesian modeling techniques. The framework naturally encompasses
restless bandits, contextual bandits, and other bandit variants under a single
inferential model. Despite the model's generality, we propose efficient Monte
Carlo algorithms to make inference scalable, based on recent developments in
sequential Monte Carlo methods. Through two simulation studies, the framework
is shown to outperform other empirical methods, while also naturally scaling to
more complex problems for which existing approaches can not cope. Additionally,
we successfully apply our framework to online video-based advertising
recommendation, and show its increased efficacy as compared to current state of
the art bandit algorithms
Correlated Multi-armed Bandits with a Latent Random Source
We consider a novel multi-armed bandit framework where the rewards obtained
by pulling the arms are functions of a common latent random variable. The
correlation between arms due to the common random source can be used to design
a generalized upper-confidence-bound (UCB) algorithm that identifies certain
arms as , and avoids exploring them. As a result, we reduce a
-armed bandit problem to a -armed problem, where includes the
best arm and arms. Our regret analysis shows that the
competitive arms need to be pulled times, while the
non-competitive arms are pulled only times. As a result, there
are regimes where our algorithm achieves a regret as opposed
to the typical logarithmic regret scaling of multi-armed bandit algorithms. We
also evaluate lower bounds on the expected regret and prove that our
correlated-UCB algorithm achieves regret whenever possible
Bandits with adversarial scaling
We study "adversarial scaling", a multi-armed bandit model where rewards have
a stochastic and an adversarial component. Our model captures display
advertising where the "click-through-rate" can be decomposed to a (fixed across
time) arm-quality component and a non-stochastic user-relevance component
(fixed across arms). Despite the relative stochasticity of our model, we
demonstrate two settings where most bandit algorithms suffer. On the positive
side, we show that two algorithms, one from the action elimination and one from
the mirror descent family are adaptive enough to be robust to adversarial
scaling. Our results shed light on the robustness of adaptive parameter
selection in stochastic bandits, which may be of independent interest.Comment: Appeared in ICML 202
A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit Setting
We consider a finite-armed structured bandit problem in which mean rewards of
different arms are known functions of a common hidden parameter .
Since we do not place any restrictions of these functions, the problem setting
subsumes several previously studied frameworks that assume linear or invertible
reward functions. We propose a novel approach to gradually estimate the hidden
and use the estimate together with the mean reward functions to
substantially reduce exploration of sub-optimal arms. This approach enables us
to fundamentally generalize any classic bandit algorithm including UCB and
Thompson Sampling to the structured bandit setting. We prove via regret
analysis that our proposed UCB-C and TS-C algorithms (structured bandit
versions of UCB and Thompson Sampling, respectively) pull only a subset of the
sub-optimal arms times while the other sub-optimal arms (referred
to as non-competitive arms) are pulled times. As a result, in cases
where all sub-optimal arms are non-competitive, which can happen in many
practical scenarios, the proposed algorithms achieve bounded regret. We also
conduct simulations on the Movielens recommendations dataset to demonstrate the
improvement of the proposed algorithms over existing structured bandit
algorithms
Unimodal Bandits without Smoothness
We consider stochastic bandit problems with a continuous set of arms and
where the expected reward is a continuous and unimodal function of the arm. No
further assumption is made regarding the smoothness and the structure of the
expected reward function. For these problems, we propose the Stochastic
Pentachotomy (SP) algorithm, and derive finite-time upper bounds on its regret
and optimization error. In particular, we show that, for any expected reward
function that behaves as locally
around its maximizer for some , the SP algorithm is
order-optimal. Namely its regret and optimization error scale as
and , respectively, when the time
horizon grows large. These scalings are achieved without the knowledge of
and . Our algorithm is based on asymptotically optimal sequential
statistical tests used to successively trim an interval that contains the best
arm with high probability. To our knowledge, the SP algorithm constitutes the
first sequential arm selection rule that achieves a regret and optimization
error scaling as and , respectively, up to a
logarithmic factor for non-smooth expected reward functions, as well as for
smooth functions with unknown smoothness.Comment: 25 page
HAMLET -- A Learning Curve-Enabled Multi-Armed Bandit for Algorithm Selection
Automated algorithm selection and hyperparameter tuning facilitates the
application of machine learning. Traditional multi-armed bandit strategies look
to the history of observed rewards to identify the most promising arms for
optimizing expected total reward in the long run. When considering limited time
budgets and computational resources, this backward view of rewards is
inappropriate as the bandit should look into the future for anticipating the
highest final reward at the end of a specified time budget. This work addresses
that insight by introducing HAMLET, which extends the bandit approach with
learning curve extrapolation and computation time-awareness for selecting among
a set of machine learning algorithms. Results show that the HAMLET Variants 1-3
exhibit equal or better performance than other bandit-based algorithm selection
strategies in experiments with recorded hyperparameter tuning traces for the
majority of considered time budgets. The best performing HAMLET Variant 3
combines learning curve extrapolation with the well-known upper confidence
bound exploration bonus. That variant performs better than all non-HAMLET
policies with statistical significance at the 95% level for 1,485 runs.Comment: 8 pages, 8 figures; IJCNN 2020: International Joint Conference on
Neural Network
Learning Unknown Service Rates in Queues: A Multi-Armed Bandit Approach
Consider a queueing system consisting of multiple servers. Jobs arrive over
time and enter a queue for service; the goal is to minimize the size of this
queue. At each opportunity for service, at most one server can be chosen, and
at most one job can be served. Service is successful with a probability (the
service probability) that is a priori unknown for each server. An algorithm
that knows the service probabilities (the "genie") can always choose the server
of highest service probability. We study algorithms that learn the unknown
service probabilities. Our goal is to minimize queue-regret: the (expected)
difference between the queue-lengths obtained by the algorithm, and those
obtained by the "genie."
Since queue-regret cannot be larger than classical regret, results for the
standard multi-armed bandit problem give algorithms for which queue-regret
increases no more than logarithmically in time. Our paper shows surprisingly
more complex behavior. In particular, as long as the bandit algorithm's queues
have relatively long regenerative cycles, queue-regret is similar to cumulative
regret, and scales (essentially) logarithmically. However, we show that this
"early stage" of the queueing bandit eventually gives way to a "late stage",
where the optimal queue-regret scaling is . We demonstrate an algorithm
that (order-wise) achieves this asymptotic queue-regret in the late stage. Our
results are developed in a more general model that allows for multiple job
classes as well
Learning Sequential Channel Selection for Interference Alignment using Reconfigurable Antennas
In recent years, machine learning techniques have been explored to support,
enhance or augment wireless systems especially at the physical layer of the
protocol stack. Traditional ML based approach or optimization is often not
suitable due to algorithmic complexity, reliance on existing training data
and/or due to distributed setting. In this paper, we formulate a reconfigurable
antenna based channel selection problem for interference alignment in a
multi-user wireless network as a learning problem. More specifically, we
propose that by using sequential learning, an effective channel or combination
of channels can be selected in order to enhance interference alignment using
reconfigurable antennas. We first formulate the channel selection as a
multi-armed problem that aims to optimize the sum rate of the network. We show
that by using an adaptive sequential learning policy, each node in the network
can learn to select optimal channels without requiring full and instantaneous
CSI for all the available antenna states. We conduct performance analysis of
our technique for a MIMO interference channel using a conventional IA scheme
and quantify the benefits of pattern diversity and learning channel selection
UCB Algorithm for Exponential Distributions
We introduce in this paper a new algorithm for Multi-Armed Bandit (MAB)
problems. A machine learning paradigm popular within Cognitive Network related
topics (e.g., Spectrum Sensing and Allocation). We focus on the case where the
rewards are exponentially distributed, which is common when dealing with
Rayleigh fading channels. This strategy, named Multiplicative Upper Confidence
Bound (MUCB), associates a utility index to every available arm, and then
selects the arm with the highest index. For every arm, the associated index is
equal to the product of a multiplicative factor by the sample mean of the
rewards collected by this arm. We show that the MUCB policy has a low
complexity and is order optimal.Comment: 10 pages. Introduces Multiplicative Upper Confidence Bound (MUCB)
algorithms for Multi-Armed Bandit problem
- …