703 research outputs found
Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization
In many applications, e.g. in healthcare and e-commerce, the goal of a
contextual bandit may be to learn an optimal treatment assignment policy at the
end of the experiment. That is, to minimize simple regret. However, this
objective remains understudied. We propose a new family of computationally
efficient bandit algorithms for the stochastic contextual bandit setting, where
a tuning parameter determines the weight placed on cumulative regret
minimization (where we establish near-optimal minimax guarantees) versus simple
regret minimization (where we establish state-of-the-art guarantees). Our
algorithms work with any function class, are robust to model misspecification,
and can be used in continuous arm settings. This flexibility comes from
constructing and relying on "conformal arm sets" (CASs). CASs provide a set of
arms for every context, encompassing the context-specific optimal arm with a
certain probability across the context distribution. Our positive results on
simple and cumulative regret guarantees are contrasted with a negative result,
which shows that no algorithm can achieve instance-dependent simple regret
guarantees while simultaneously achieving minimax optimal cumulative regret
guarantees
- …