9 research outputs found
Stochastic Contextual Bandits with Graph-based Contexts
We naturally generalize the on-line graph prediction problem to a version of
stochastic contextual bandit problems where contexts are vertices in a graph
and the structure of the graph provides information on the similarity of
contexts. More specifically, we are given a graph , whose vertex set
represents contexts with {\em unknown} vertex label . In our stochastic
contextual bandit setting, vertices with the same label share the same reward
distribution. The standard notion of instance difficulties in graph label
prediction is the cutsize defined to be the number of edges whose end
points having different labels. For line graphs and trees we present an
algorithm with regret bound of where is
the number of arms. Our algorithm relies on the optimal stochastic bandit
algorithm by Zimmert and Seldin~[AISTAT'19, JMLR'21]. When the best arm
outperforms the other arms, the regret improves to . The regret bound in the later case is comparable to other optimal
contextual bandit results in more general cases, but our algorithm is easy to
analyze, runs very efficiently, and does not require an i.i.d. assumption on
the input context sequence. The algorithm also works with general graphs using
a standard random spanning tree reduction
Reward Imputation with Sketching for Contextual Batched Bandits
Contextual batched bandit (CBB) is a setting where a batch of rewards is
observed from the environment at the end of each episode, but the rewards of
the non-executed actions are unobserved, resulting in partial-information
feedback. Existing approaches for CBB often ignore the rewards of the
non-executed actions, leading to underutilization of feedback information. In
this paper, we propose an efficient approach called Sketched Policy Updating
with Imputed Rewards (SPUIR) that completes the unobserved rewards using
sketching, which approximates the full-information feedbacks. We formulate
reward imputation as an imputation regularized ridge regression problem that
captures the feedback mechanisms of both executed and non-executed actions. To
reduce time complexity, we solve the regression problem using randomized
sketching. We prove that our approach achieves an instantaneous regret with
controllable bias and smaller variance than approaches without reward
imputation. Furthermore, our approach enjoys a sublinear regret bound against
the optimal policy. We also present two extensions, a rate-scheduled version
and a version for nonlinear rewards, making our approach more practical.
Experimental results show that SPUIR outperforms state-of-the-art baselines on
synthetic, public benchmark, and real-world datasets.Comment: Accepted by NeurIPS 202
Clustered Multi-Agent Linear Bandits
We address in this paper a particular instance of the multi-agent linear
stochastic bandit problem, called clustered multi-agent linear bandits. In this
setting, we propose a novel algorithm leveraging an efficient collaboration
between the agents in order to accelerate the overall optimization problem. In
this contribution, a network controller is responsible for estimating the
underlying cluster structure of the network and optimizing the experiences
sharing among agents within the same groups. We provide a theoretical analysis
for both the regret minimization problem and the clustering quality. Through
empirical evaluation against state-of-the-art algorithms on both synthetic and
real data, we demonstrate the effectiveness of our approach: our algorithm
significantly improves regret minimization while managing to recover the true
underlying cluster partitioning.Comment: 18 pages, 8 figure
Local Clustering in Contextual Multi-Armed Bandits
We study identifying user clusters in contextual multi-armed bandits (MAB).
Contextual MAB is an effective tool for many real applications, such as content
recommendation and online advertisement. In practice, user dependency plays an
essential role in the user's actions, and thus the rewards. Clustering similar
users can improve the quality of reward estimation, which in turn leads to more
effective content recommendation and targeted advertising. Different from
traditional clustering settings, we cluster users based on the unknown bandit
parameters, which will be estimated incrementally. In particular, we define the
problem of cluster detection in contextual MAB, and propose a bandit algorithm,
LOCB, embedded with local clustering procedure. And, we provide theoretical
analysis about LOCB in terms of the correctness and efficiency of clustering
and its regret bound. Finally, we evaluate the proposed algorithm from various
aspects, which outperforms state-of-the-art baselines.Comment: 12 page
Optimal Algorithms for Latent Bandits with Cluster Structure
We consider the problem of latent bandits with cluster structure where there
are multiple users, each with an associated multi-armed bandit problem. These
users are grouped into \emph{latent} clusters such that the mean reward vectors
of users within the same cluster are identical. At each round, a user, selected
uniformly at random, pulls an arm and observes a corresponding noisy reward.
The goal of the users is to maximize their cumulative rewards. This problem is
central to practical recommendation systems and has received wide attention of
late \cite{gentile2014online, maillard2014latent}. Now, if each user acts
independently, then they would have to explore each arm independently and a
regret of is unavoidable, where are the number of arms and users, respectively. Instead, we propose
LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploitation of the
latent cluster structure to provide the minimax optimal regret of
, when the number of
clusters is . This is the first algorithm to guarantee such
strong regret bound. LATTICE is based on a careful exploitation of arm
information within a cluster while simultaneously clustering users.
Furthermore, it is computationally efficient and requires only
calls to an offline matrix completion oracle across all
rounds.Comment: 48 pages. Accepted to AISTATS 2023. Added Experiment