44 research outputs found
Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression
We consider contextual bandits with linear constraints (CBwLC), a variant of
contextual bandits in which the algorithm consumes multiple resources subject
to linear constraints on total consumption. This problem generalizes contextual
bandits with knapsacks (CBwK), allowing for packing and covering constraints,
as well as positive and negative resource consumption. We provide the first
algorithm for CBwLC (or CBwK) that is based on regression oracles. The
algorithm is simple, computationally efficient, and admits vanishing regret. It
is statistically optimal for the variant of CBwK in which the algorithm must
stop once some constraint is violated. Further, we provide the first
vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the
stochastic environment. We side-step strong impossibility results from prior
work by identifying a weaker (and, arguably, fairer) benchmark to compare
against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a
Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML
2020), a regression-based technique for contextual bandits. Our analysis
leverages the inherent modularity of both techniques
Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for
multi-armed bandits under supply/budget constraints. In particular, a bandit
algorithm needs to solve a well-known knapsack problem: find an optimal packing
of items into a limited-size knapsack. The BwK problem is a common
generalization of numerous motivating examples, which range from dynamic
pricing to repeated auctions to dynamic ad allocation to network routing and
scheduling. While the prior work on BwK focused on the stochastic version, we
pioneer the other extreme in which the outcomes can be chosen adversarially.
This is a considerably harder problem, compared to both the stochastic version
and the "classic" adversarial bandits, in that regret minimization is no longer
feasible. Instead, the objective is to minimize the competitive ratio: the
ratio of the benchmark reward to the algorithm's reward.
We design an algorithm with competitive ratio O(log T) relative to the best
fixed distribution over actions, where T is the time horizon; we also prove a
matching lower bound. The key conceptual contribution is a new perspective on
the stochastic version of the problem. We suggest a new algorithm for the
stochastic version, which builds on the framework of regret minimization in
repeated games and admits a substantially simpler analysis compared to prior
work. We then analyze this algorithm for the adversarial version and use it as
a subroutine to solve the latter.Comment: Extended abstract appeared in FOCS 201
Clustered Linear Contextual Bandits with Knapsacks
In this work, we study clustered contextual bandits where rewards and
resource consumption are the outcomes of cluster-specific linear models. The
arms are divided in clusters, with the cluster memberships being unknown to an
algorithm. Pulling an arm in a time period results in a reward and in
consumption for each one of multiple resources, and with the total consumption
of any resource exceeding a constraint implying the termination of the
algorithm. Thus, maximizing the total reward requires learning not only models
about the reward and the resource consumption, but also cluster memberships. We
provide an algorithm that achieves regret sublinear in the number of time
periods, without requiring access to all of the arms. In particular, we show
that it suffices to perform clustering only once to a randomly selected subset
of the arms. To achieve this result, we provide a sophisticated combination of
techniques from the literature of econometrics and of bandits with constraints