5 research outputs found
Incentivized Exploration for Multi-Armed Bandits under Reward Drift
We study incentivized exploration for the multi-armed bandit (MAB) problem
where the players receive compensation for exploring arms other than the greedy
choice and may provide biased feedback on reward. We seek to understand the
impact of this drifted reward feedback by analyzing the performance of three
instantiations of the incentivized MAB algorithm: UCB, -Greedy,
and Thompson Sampling. Our results show that they all achieve regret and compensation under the drifted reward, and are therefore
effective in incentivizing exploration. Numerical examples are provided to
complement the theoretical analysis.Comment: 10 pages, 2 figures, AAAI 202
Incentivizing Exploration with Linear Contexts and Combinatorial Actions
We advance the study of incentivized bandit exploration, in which arm choices
are viewed as recommendations and are required to be Bayesian incentive
compatible. Recent work has shown under certain independence assumptions that
after collecting enough initial samples, the popular Thompson sampling
algorithm becomes incentive compatible. We give an analog of this result for
linear bandits, where the independence of the prior is replaced by a natural
convexity condition. This opens up the possibility of efficient and
regret-optimal incentivized exploration in high-dimensional action spaces. In
the semibandit model, we also improve the sample complexity for the
pre-Thompson sampling phase of initial data collection.Comment: International Conference on Machine Learning (ICML) 202
Reward Teaching for Federated Multi-armed Bandits
Most of the existing federated multi-armed bandits (FMAB) designs are based
on the presumption that clients will implement the specified design to
collaborate with the server. In reality, however, it may not be possible to
modify the clients' existing protocols. To address this challenge, this work
focuses on clients who always maximize their individual cumulative rewards, and
introduces a novel idea of ``reward teaching'', where the server guides the
clients towards global optimality through implicit local reward adjustments.
Under this framework, the server faces two tightly coupled tasks of bandit
learning and target teaching, whose combination is non-trivial and challenging.
A phased approach, called Teaching-After-Learning (TAL), is first designed to
encourage and discourage clients' explorations separately. General performance
analyses of TAL are established when the clients' strategies satisfy certain
mild requirements. With novel technical approaches developed to analyze the
warm-start behaviors of bandit algorithms, particularized guarantees of TAL
with clients running UCB or epsilon-greedy strategies are then obtained. These
results demonstrate that TAL achieves logarithmic regrets while only incurring
logarithmic adjustment costs, which is order-optimal w.r.t. a natural lower
bound. As a further extension, the Teaching-While-Learning (TWL) algorithm is
developed with the idea of successive arm elimination to break the non-adaptive
phase separation in TAL. Rigorous analyses demonstrate that when facing clients
with UCB1, TWL outperforms TAL in terms of the dependencies on sub-optimality
gaps thanks to its adaptive design. Experimental results demonstrate the
effectiveness and generality of the proposed algorithms.Comment: Accepted to IEEE Transactions on Signal Processin