150 research outputs found
A Contextual Bandit Bake-off
Contextual bandit algorithms are essential for solving many real-world
interactive machine learning problems. Despite multiple recent successes on
statistically and computationally efficient methods, the practical behavior of
these algorithms is still poorly understood. We leverage the availability of
large numbers of supervised learning datasets to empirically evaluate
contextual bandit algorithms, focusing on practical methods that learn by
relying on optimization oracles from supervised learning. We find that a recent
method (Foster et al., 2018) using optimism under uncertainty works the best
overall. A surprisingly close second is a simple greedy baseline that only
explores implicitly through the diversity of contexts, followed by a variant of
Online Cover (Agarwal et al., 2014) which tends to be more conservative but
robust to problem specification by design. Along the way, we also evaluate
various components of contextual bandit algorithm design such as loss
estimators. Overall, this is a thorough study and review of contextual bandit
methodology
Stochastic Graph Bandit Learning with Side-Observations
In this paper, we investigate the stochastic contextual bandit with general
function space and graph feedback. We propose an algorithm that addresses this
problem by adapting to both the underlying graph structures and reward gaps. To
the best of our knowledge, our algorithm is the first to provide a
gap-dependent upper bound in this stochastic setting, bridging the research gap
left by the work in [35]. In comparison to [31,33,35], our method offers
improved regret upper bounds and does not require knowledge of graphical
quantities. We conduct numerical experiments to demonstrate the computational
efficiency and effectiveness of our approach in terms of regret upper bounds.
These findings highlight the significance of our algorithm in advancing the
field of stochastic contextual bandits with graph feedback, opening up avenues
for practical applications in various domains.Comment: arXiv admin note: text overlap with arXiv:2010.03104 by other author
Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression
We consider contextual bandits with linear constraints (CBwLC), a variant of
contextual bandits in which the algorithm consumes multiple resources subject
to linear constraints on total consumption. This problem generalizes contextual
bandits with knapsacks (CBwK), allowing for packing and covering constraints,
as well as positive and negative resource consumption. We provide the first
algorithm for CBwLC (or CBwK) that is based on regression oracles. The
algorithm is simple, computationally efficient, and admits vanishing regret. It
is statistically optimal for the variant of CBwK in which the algorithm must
stop once some constraint is violated. Further, we provide the first
vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the
stochastic environment. We side-step strong impossibility results from prior
work by identifying a weaker (and, arguably, fairer) benchmark to compare
against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a
Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML
2020), a regression-based technique for contextual bandits. Our analysis
leverages the inherent modularity of both techniques
Infinite Action Contextual Bandits with Reusable Data Exhaust
For infinite action contextual bandits, smoothed regret and reduction to
regression results in state-of-the-art online performance with computational
cost independent of the action set: unfortunately, the resulting data exhaust
does not have well-defined importance-weights. This frustrates the execution of
downstream data science processes such as offline model selection. In this
paper we describe an online algorithm with an equivalent smoothed regret
guarantee, but which generates well-defined importance weights: in exchange,
the online computational cost increases, but only to order smoothness (i.e.,
still independent of the action set). This removes a key obstacle to adoption
of smoothed regret in production scenarios.Comment: Final version after responding to reviewer
Counterfactual Optimism: Rate Optimal Regret for Stochastic Contextual MDPs
We present the UCRL algorithm for regret minimization in Stochastic
Contextual MDPs (CMDPs). The algorithm operates under the minimal assumptions
of realizable function class, and access to offline least squares and log loss
regression oracles. Our algorithm is efficient (assuming efficient offline
regression oracles) and enjoys an regret guarantee,
with being the number of episodes, the state space, the action
space, the horizon, and and are finite function
classes, used to approximate the context-dependent dynamics and rewards,
respectively. To the best of our knowledge, our algorithm is the first
efficient and rate-optimal regret minimization algorithm for CMDPs, which
operates under the general offline function approximation setting
Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits
We consider offline policy optimization (OPO) in contextual bandits, where
one is given a fixed dataset of logged interactions. While pessimistic
regularizers are typically used to mitigate distribution shift, prior
implementations thereof are either specialized or computationally inefficient.
We present the first general oracle-efficient algorithm for pessimistic OPO: it
reduces to supervised learning, leading to broad applicability. We obtain
statistical guarantees analogous to those for prior pessimistic approaches. We
instantiate our approach for both discrete and continuous actions and perform
experiments in both settings, showing advantage over unregularized OPO across a
wide range of configurations
- …