116 research outputs found
Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems
Restless bandit problems are instances of non-stationary multi-armed bandits.
These problems have been studied well from the optimization perspective, where
the goal is to efficiently find a near-optimal policy when system parameters
are known. However, very few papers adopt a learning perspective, where the
parameters are unknown. In this paper, we analyze the performance of Thompson
sampling in episodic restless bandits with unknown parameters. We consider a
general policy map to define our competitor and prove an
Bayesian regret bound. Our competitor is
flexible enough to represent various benchmarks including the best fixed action
policy, the optimal policy, the Whittle index policy, or the myopic policy. We
also present empirical results that support our theoretical findings
Optimal Recommendation to Users that React: Online Learning for a Class of POMDPs
We describe and study a model for an Automated Online Recommendation System
(AORS) in which a user's preferences can be time-dependent and can also depend
on the history of past recommendations and play-outs. The three key features of
the model that makes it more realistic compared to existing models for
recommendation systems are (1) user preference is inherently latent, (2)
current recommendations can affect future preferences, and (3) it allows for
the development of learning algorithms with provable performance guarantees.
The problem is cast as an average-cost restless multi-armed bandit for a given
user, with an independent partially observable Markov decision process (POMDP)
for each item of content. We analyze the POMDP for a single arm, describe its
structural properties, and characterize its optimal policy. We then develop a
Thompson sampling-based online reinforcement learning algorithm to learn the
parameters of the model and optimize utility from the binary responses of the
users to continuous recommendations. We then analyze the performance of the
learning algorithm and characterize the regret. Illustrative numerical results
and directions for extension to the restless hidden Markov multi-armed bandit
problem are also presented.Comment: 8 pages, submitted to conferenc
Sequential Monte Carlo Bandits
In this paper we propose a flexible and efficient framework for handling
multi-armed bandits, combining sequential Monte Carlo algorithms with
hierarchical Bayesian modeling techniques. The framework naturally encompasses
restless bandits, contextual bandits, and other bandit variants under a single
inferential model. Despite the model's generality, we propose efficient Monte
Carlo algorithms to make inference scalable, based on recent developments in
sequential Monte Carlo methods. Through two simulation studies, the framework
is shown to outperform other empirical methods, while also naturally scaling to
more complex problems for which existing approaches can not cope. Additionally,
we successfully apply our framework to online video-based advertising
recommendation, and show its increased efficacy as compared to current state of
the art bandit algorithms
On the Whittle Index for Restless Multi-armed Hidden Markov Bandits
We consider a restless multi-armed bandit in which each arm can be in one of
two states. When an arm is sampled, the state of the arm is not available to
the sampler. Instead, a binary signal with a known randomness that depends on
the state of the arm is available. No signal is available if the arm is not
sampled. An arm-dependent reward is accrued from each sampling. In each time
step, each arm changes state according to known transition probabilities which
in turn depend on whether the arm is sampled or not sampled. Since the state of
the arm is never visible and has to be inferred from the current belief and a
possible binary signal, we call this the hidden Markov bandit. Our interest is
in a policy to select the arm(s) in each time step that maximizes the infinite
horizon discounted reward. Specifically, we seek the use of Whittle's index in
selecting the arms. We first analyze the single-armed bandit and show that in
general, it admits an approximate threshold-type optimal policy when there is a
positive reward for the `no-sample' action. We also identify several special
cases for which the threshold policy is indeed the optimal policy. Next, we
show that such a single-armed bandit also satisfies an approximate-indexability
property. For the case when the single-armed bandit admits a threshold-type
optimal policy, we perform the calculation of the Whittle index for each arm.
Numerical examples illustrate the analytical results.Comment: Revised version, corrected few typo
Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits
We study the online restless bandit problem, where the state of each arm
evolves according to a Markov chain, and the reward of pulling an arm depends
on both the pulled arm and the current state of the corresponding Markov chain.
In this paper, we propose Restless-UCB, a learning policy that follows the
explore-then-commit framework. In Restless-UCB, we present a novel method to
construct offline instances, which only requires time-complexity ( is
the number of arms) and is exponentially better than the complexity of existing
learning policy. We also prove that Restless-UCB achieves a regret upper bound
of , where is the Markov chain state space
size and is the time horizon. Compared to existing algorithms, our result
eliminates the exponential factor (in ) in the regret upper bound, due to
a novel exploitation of the sparsity in transitions in general restless bandit
problems. As a result, our analysis technique can also be adopted to tighten
the regret bounds of existing algorithms. Finally, we conduct experiments based
on real-world dataset, to compare the Restless-UCB policy with state-of-the-art
benchmarks. Our results show that Restless-UCB outperforms existing algorithms
in regret, and significantly reduces the running time
Thompson Sampling in Non-Episodic Restless Bandits
Restless bandit problems assume time-varying reward distributions of the
arms, which adds flexibility to the model but makes the analysis more
challenging. We study learning algorithms over the unknown reward distributions
and prove a sub-linear, , regret bound for a variant of
Thompson sampling. Our analysis applies in the infinite time horizon setting,
resolving the open question raised by Jung and Tewari (2019) whose analysis is
limited to the episodic case. We adopt their policy mapping framework, which
allows our algorithm to be efficient and simultaneously keeps the regret
meaningful. Our algorithm adapts the TSDE algorithm of Ouyang et al. (2017) in
a non-trivial manner to account for the special structure of restless bandits.
We test our algorithm on a simulated dynamic channel access problem with
several policy mappings, and the empirical regrets agree with the theoretical
bound regardless of the choice of the policy mapping
Policy Gradients for Contextual Recommendations
Decision making is a challenging task in online recommender systems. The
decision maker often needs to choose a contextual item at each step from a set
of candidates. Contextual bandit algorithms have been successfully deployed to
such applications, for the trade-off between exploration and exploitation and
the state-of-art performance on minimizing online costs. However, the
applicability of existing contextual bandit methods is limited by the
over-simplified assumptions of the problem, such as assuming a simple form of
the reward function or assuming a static environment where the states are not
affected by previous actions. In this work, we put forward Policy Gradients for
Contextual Recommendations (PGCR) to solve the problem without those
unrealistic assumptions. It optimizes over a restricted class of policies where
the marginal probability of choosing an item (in expectation of other items)
has a simple closed form, and the gradient of the expected return over the
policy in this class is in a succinct form. Moreover, PGCR leverages two useful
heuristic techniques called Time-Dependent Greed and Actor-Dropout. The former
ensures PGCR to be empirically greedy in the limit, and the latter addresses
the trade-off between exploration and exploitation by using the policy network
with Dropout as a Bayesian approximation. PGCR can solve the standard
contextual bandits as well as its Markov Decision Process generalization.
Therefore it can be applied to a wide range of realistic settings of
recommendations, such as personalized advertising. We evaluate PGCR on toy
datasets as well as a real-world dataset of personalized music recommendations.
Experiments show that PGCR enables fast convergence and low regret, and
outperforms both classic contextual-bandits and vanilla policy gradient
methods.Comment: Accepted at WWW-201
Screening for an Infectious Disease as a Problem in Stochastic Control
There has been much recent interest in screening populations for an
infectious disease. Here, we present a stochastic-control model, wherein the
optimum screening policy is provably difficult to find, but wherein Thompson
sampling has provably optimal performance guarantees in the form of Bayesian
regret. Thompson sampling seems applicable especially to diseases, for which we
do not understand the dynamics well, such as to the super-spreading COVID-19
Learning Unknown Service Rates in Queues: A Multi-Armed Bandit Approach
Consider a queueing system consisting of multiple servers. Jobs arrive over
time and enter a queue for service; the goal is to minimize the size of this
queue. At each opportunity for service, at most one server can be chosen, and
at most one job can be served. Service is successful with a probability (the
service probability) that is a priori unknown for each server. An algorithm
that knows the service probabilities (the "genie") can always choose the server
of highest service probability. We study algorithms that learn the unknown
service probabilities. Our goal is to minimize queue-regret: the (expected)
difference between the queue-lengths obtained by the algorithm, and those
obtained by the "genie."
Since queue-regret cannot be larger than classical regret, results for the
standard multi-armed bandit problem give algorithms for which queue-regret
increases no more than logarithmically in time. Our paper shows surprisingly
more complex behavior. In particular, as long as the bandit algorithm's queues
have relatively long regenerative cycles, queue-regret is similar to cumulative
regret, and scales (essentially) logarithmically. However, we show that this
"early stage" of the queueing bandit eventually gives way to a "late stage",
where the optimal queue-regret scaling is . We demonstrate an algorithm
that (order-wise) achieves this asymptotic queue-regret in the late stage. Our
results are developed in a more general model that allows for multiple job
classes as well
Value Directed Exploration in Multi-Armed Bandits with Structured Priors
Multi-armed bandits are a quintessential machine learning problem requiring
the balancing of exploration and exploitation. While there has been progress in
developing algorithms with strong theoretical guarantees, there has been less
focus on practical near-optimal finite-time performance. In this paper, we
propose an algorithm for Bayesian multi-armed bandits that utilizes
value-function-driven online planning techniques. Building on previous work on
UCB and Gittins index, we introduce linearly-separable value functions that
take both the expected return and the benefit of exploration into consideration
to perform n-step lookahead. The algorithm enjoys a sub-linear performance
guarantee and we present simulation results that confirm its strength in
problems with structured priors. The simplicity and generality of our approach
makes it a strong candidate for analyzing more complex multi-armed bandit
problems
- …