The contextual linear bandit is an important online learning problem where
given arm features, a learning agent selects an arm at each round to maximize
the cumulative rewards in the long run. A line of works, called the clustering
of bandits (CB), utilize the collaborative effect over user preferences and
have shown significant improvements over classic linear bandit algorithms.
However, existing CB algorithms require well-specified linear user models and
can fail when this critical assumption does not hold. Whether robust CB
algorithms can be designed for more practical scenarios with misspecified user
models remains an open problem. In this paper, we are the first to present the
important problem of clustering of bandits with misspecified user models
(CBMUM), where the expected rewards in user models can be perturbed away from
perfect linear models. We devise two robust CB algorithms, RCLUMB and RSCLUMB
(representing the learned clustering structure with dynamic graph and sets,
respectively), that can accommodate the inaccurate user preference estimations
and erroneous clustering caused by model misspecifications. We prove regret
upper bounds of O(ϵ∗TmdlogT+dmTlogT) for our
algorithms under milder assumptions than previous CB works (notably, we move
past a restrictive technical assumption on the distribution of the arms), which
match the lower bound asymptotically in T up to logarithmic factors, and also
match the state-of-the-art results in several degenerate cases. The techniques
in proving the regret caused by misclustering users are quite general and may
be of independent interest. Experiments on both synthetic and real-world data
show our outperformance over previous algorithms