In the classical contextual bandits problem, in each round t, a learner
observes some context c, chooses some action a to perform, and receives
some reward ra,t​(c). We consider the variant of this problem where in
addition to receiving the reward ra,t​(c), the learner also learns the
values of ra,t​(c′) for all other contexts c′; i.e., the rewards that
would have been achieved by performing that action under different contexts.
This variant arises in several strategic settings, such as learning how to bid
in non-truthful repeated auctions (in this setting the context is the decision
maker's private valuation for each auction). We call this problem the
contextual bandits problem with cross-learning. The best algorithms for the
classical contextual bandits problem achieve O~(CKT​) regret
against all stationary policies, where C is the number of contexts, K the
number of actions, and T the number of rounds. We demonstrate algorithms for
the contextual bandits problem with cross-learning that remove the dependence
on C and achieve regret O(KT​) (when contexts are stochastic with
known distribution), O~(K1/3T2/3) (when contexts are stochastic
with unknown distribution), and O~(KT​) (when contexts are
adversarial but rewards are stochastic).Comment: 48 pages, 5 figure