39 research outputs found
Switching Latent Bandits
We consider a Latent Bandit problem where the latent state keeps changing in time according to an underlying Markov Chain and every state is represented by a specific Bandit instance. At each step, the agent chooses an arm and observes a random reward but is unaware of which MAB he is currently pulling. As typical in Latent Bandits, we assume to know the reward distribution of the arms of all the Bandit instances. Within this setting, our goal is to learn the transition matrix determined by the Markov process, so as to minimize the cumulative regret. We propose a technique to solve this estimation problem that exploits the properties of Markov Chains and results in solving a system of linear equations. We present an offline method that chooses the best subset of possible arms that can be used for matrix estimation, and we ultimately introduce the SL-EC learning algorithm based on an Explore Then Commit strategy that builds a belief representation of the current state and optimizes the instantaneous regret at each step. This algorithm achieves a regret of the order O(T^(2/3))
with T being the interaction horizon. Finally, we illustrate the effectiveness of the approach and compare it with state-of-the-art algorithms for non-stationary bandits
Optimal Algorithms for Latent Bandits with Cluster Structure
We consider the problem of latent bandits with cluster structure where there
are multiple users, each with an associated multi-armed bandit problem. These
users are grouped into \emph{latent} clusters such that the mean reward vectors
of users within the same cluster are identical. At each round, a user, selected
uniformly at random, pulls an arm and observes a corresponding noisy reward.
The goal of the users is to maximize their cumulative rewards. This problem is
central to practical recommendation systems and has received wide attention of
late \cite{gentile2014online, maillard2014latent}. Now, if each user acts
independently, then they would have to explore each arm independently and a
regret of is unavoidable, where are the number of arms and users, respectively. Instead, we propose
LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploitation of the
latent cluster structure to provide the minimax optimal regret of
, when the number of
clusters is . This is the first algorithm to guarantee such
strong regret bound. LATTICE is based on a careful exploitation of arm
information within a cluster while simultaneously clustering users.
Furthermore, it is computationally efficient and requires only
calls to an offline matrix completion oracle across all
rounds.Comment: 48 pages. Accepted to AISTATS 2023. Added Experiment
Improved sequential decision-making with structural priors: Enhanced treatment personalization with historical data
Personalizing treatments for patients involves a period where different treatments out of a set of available treatments are tried until an optimal treatment is found, for particular patient characteristics. To minimize suffering and other costs, it is critical to minimize this search. When treatments have primarily short-term effects, the search can be performed with multi-armed bandit algorithms (MABs). However, these typically require long exploration periods to guarantee optimality. With historical data, it is possible to recover a structure incorporating the prior knowledge of the types of patients that can be encountered, and the conditional reward models for those patient types. Such structural priors can be used to reduce the treatment exploration period for enhanced applicability in the real world. This thesis presents work on designing MAB algorithms that find optimal treatments quickly, by incorporating a structural prior for patient types in the form of a latent variable model. Theoretical guarantees for the algorithms, including a lower and a matching upper bound, and an empirical study is provided, showing that incorporating latent structural priors is beneficial. Another line of work in this thesis is the design of simulators for evaluating treatment policies and comparing algorithms. A new simulator for benchmarking estimators of causal effects, the Alzheimer’s Disease Causal estimation Benchmark (ADCB) is presented. ADCB combines data-driven simulation with subject-matter knowledge for high realism and causal verifiability. The design of the simulator is discussed, and to demonstrate its utility, the results of a usage scenario for evaluating estimators of causal effects are outlined
High Accuracy and Low Regret for User-Cold-Start Using Latent Bandits
We develop a novel latent-bandit algorithm for tackling the cold-start
problem for new users joining a recommender system. This new algorithm
significantly outperforms the state of the art, simultaneously achieving both
higher accuracy and lower regret.Comment: 7 pages, 7 figures, Esann 2022 conferenc
Stochastic Contextual Bandits with Graph-based Contexts
We naturally generalize the on-line graph prediction problem to a version of
stochastic contextual bandit problems where contexts are vertices in a graph
and the structure of the graph provides information on the similarity of
contexts. More specifically, we are given a graph , whose vertex set
represents contexts with {\em unknown} vertex label . In our stochastic
contextual bandit setting, vertices with the same label share the same reward
distribution. The standard notion of instance difficulties in graph label
prediction is the cutsize defined to be the number of edges whose end
points having different labels. For line graphs and trees we present an
algorithm with regret bound of where is
the number of arms. Our algorithm relies on the optimal stochastic bandit
algorithm by Zimmert and Seldin~[AISTAT'19, JMLR'21]. When the best arm
outperforms the other arms, the regret improves to . The regret bound in the later case is comparable to other optimal
contextual bandit results in more general cases, but our algorithm is easy to
analyze, runs very efficiently, and does not require an i.i.d. assumption on
the input context sequence. The algorithm also works with general graphs using
a standard random spanning tree reduction
Diffusion Models Meet Contextual Bandits with Large Action Spaces
Efficient exploration is a key challenge in contextual bandits due to the
large size of their action space, where uninformed exploration can result in
computational and statistical inefficiencies. Fortunately, the rewards of
actions are often correlated and this can be leveraged to explore them
efficiently. In this work, we capture such correlations using pre-trained
diffusion models; upon which we design diffusion Thompson sampling (dTS). Both
theoretical and algorithmic foundations are developed for dTS, and empirical
evaluation also shows its favorable performance.Comment: 26 pages, 5 figure