107 research outputs found
Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for
multi-armed bandits under supply/budget constraints. In particular, a bandit
algorithm needs to solve a well-known knapsack problem: find an optimal packing
of items into a limited-size knapsack. The BwK problem is a common
generalization of numerous motivating examples, which range from dynamic
pricing to repeated auctions to dynamic ad allocation to network routing and
scheduling. While the prior work on BwK focused on the stochastic version, we
pioneer the other extreme in which the outcomes can be chosen adversarially.
This is a considerably harder problem, compared to both the stochastic version
and the "classic" adversarial bandits, in that regret minimization is no longer
feasible. Instead, the objective is to minimize the competitive ratio: the
ratio of the benchmark reward to the algorithm's reward.
We design an algorithm with competitive ratio O(log T) relative to the best
fixed distribution over actions, where T is the time horizon; we also prove a
matching lower bound. The key conceptual contribution is a new perspective on
the stochastic version of the problem. We suggest a new algorithm for the
stochastic version, which builds on the framework of regret minimization in
repeated games and admits a substantially simpler analysis compared to prior
work. We then analyze this algorithm for the adversarial version and use it as
a subroutine to solve the latter.Comment: Extended abstract appeared in FOCS 201
Approximately Stationary Bandits with Knapsacks
Bandits with Knapsacks (BwK), the generalization of the Bandits problem under
global budget constraints, has received a lot of attention in recent years.
Previous work has focused on one of the two extremes: Stochastic BwK where the
rewards and consumptions of the resources of each round are sampled from an
i.i.d. distribution, and Adversarial BwK where these parameters are picked by
an adversary. Achievable guarantees in the two cases exhibit a massive gap:
No-regret learning is achievable in the stochastic case, but in the adversarial
case only competitive ratio style guarantees are achievable, where the
competitive ratio depends either on the budget or on both the time and the
number of resources. What makes this gap so vast is that in Adversarial BwK the
guarantees get worse in the typical case when the budget is more binding. While
``best-of-both-worlds'' type algorithms are known (single algorithms that
provide the best achievable guarantee in each extreme case), their bounds
degrade to the adversarial case as soon as the environment is not fully
stochastic.
Our work aims to bridge this gap, offering guarantees for a workload that is
not exactly stochastic but is also not worst-case. We define a condition,
Approximately Stationary BwK, that parameterizes how close to stochastic or
adversarial an instance is. Based on these parameters, we explore what is the
best competitive ratio attainable in BwK. We explore two algorithms that are
oblivious to the values of the parameters but guarantee competitive ratios that
smoothly transition between the best possible guarantees in the two extreme
cases, depending on the values of the parameters. Our guarantees offer great
improvement over the adversarial guarantee, especially when the available
budget is small. We also prove bounds on the achievable guarantee, showing that
our results are approximately tight when the budget is small
Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression
We consider contextual bandits with linear constraints (CBwLC), a variant of
contextual bandits in which the algorithm consumes multiple resources subject
to linear constraints on total consumption. This problem generalizes contextual
bandits with knapsacks (CBwK), allowing for packing and covering constraints,
as well as positive and negative resource consumption. We provide the first
algorithm for CBwLC (or CBwK) that is based on regression oracles. The
algorithm is simple, computationally efficient, and admits vanishing regret. It
is statistically optimal for the variant of CBwK in which the algorithm must
stop once some constraint is violated. Further, we provide the first
vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the
stochastic environment. We side-step strong impossibility results from prior
work by identifying a weaker (and, arguably, fairer) benchmark to compare
against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a
Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML
2020), a regression-based technique for contextual bandits. Our analysis
leverages the inherent modularity of both techniques
Clustered Linear Contextual Bandits with Knapsacks
In this work, we study clustered contextual bandits where rewards and
resource consumption are the outcomes of cluster-specific linear models. The
arms are divided in clusters, with the cluster memberships being unknown to an
algorithm. Pulling an arm in a time period results in a reward and in
consumption for each one of multiple resources, and with the total consumption
of any resource exceeding a constraint implying the termination of the
algorithm. Thus, maximizing the total reward requires learning not only models
about the reward and the resource consumption, but also cluster memberships. We
provide an algorithm that achieves regret sublinear in the number of time
periods, without requiring access to all of the arms. In particular, we show
that it suffices to perform clustering only once to a randomly selected subset
of the arms. To achieve this result, we provide a sophisticated combination of
techniques from the literature of econometrics and of bandits with constraints
- …