328 research outputs found
Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for
multi-armed bandits under supply/budget constraints. In particular, a bandit
algorithm needs to solve a well-known knapsack problem: find an optimal packing
of items into a limited-size knapsack. The BwK problem is a common
generalization of numerous motivating examples, which range from dynamic
pricing to repeated auctions to dynamic ad allocation to network routing and
scheduling. While the prior work on BwK focused on the stochastic version, we
pioneer the other extreme in which the outcomes can be chosen adversarially.
This is a considerably harder problem, compared to both the stochastic version
and the "classic" adversarial bandits, in that regret minimization is no longer
feasible. Instead, the objective is to minimize the competitive ratio: the
ratio of the benchmark reward to the algorithm's reward.
We design an algorithm with competitive ratio O(log T) relative to the best
fixed distribution over actions, where T is the time horizon; we also prove a
matching lower bound. The key conceptual contribution is a new perspective on
the stochastic version of the problem. We suggest a new algorithm for the
stochastic version, which builds on the framework of regret minimization in
repeated games and admits a substantially simpler analysis compared to prior
work. We then analyze this algorithm for the adversarial version and use it as
a subroutine to solve the latter.Comment: Extended abstract appeared in FOCS 201
Approximately Stationary Bandits with Knapsacks
Bandits with Knapsacks (BwK), the generalization of the Bandits problem under
global budget constraints, has received a lot of attention in recent years.
Previous work has focused on one of the two extremes: Stochastic BwK where the
rewards and consumptions of the resources of each round are sampled from an
i.i.d. distribution, and Adversarial BwK where these parameters are picked by
an adversary. Achievable guarantees in the two cases exhibit a massive gap:
No-regret learning is achievable in the stochastic case, but in the adversarial
case only competitive ratio style guarantees are achievable, where the
competitive ratio depends either on the budget or on both the time and the
number of resources. What makes this gap so vast is that in Adversarial BwK the
guarantees get worse in the typical case when the budget is more binding. While
``best-of-both-worlds'' type algorithms are known (single algorithms that
provide the best achievable guarantee in each extreme case), their bounds
degrade to the adversarial case as soon as the environment is not fully
stochastic.
Our work aims to bridge this gap, offering guarantees for a workload that is
not exactly stochastic but is also not worst-case. We define a condition,
Approximately Stationary BwK, that parameterizes how close to stochastic or
adversarial an instance is. Based on these parameters, we explore what is the
best competitive ratio attainable in BwK. We explore two algorithms that are
oblivious to the values of the parameters but guarantee competitive ratios that
smoothly transition between the best possible guarantees in the two extreme
cases, depending on the values of the parameters. Our guarantees offer great
improvement over the adversarial guarantee, especially when the available
budget is small. We also prove bounds on the achievable guarantee, showing that
our results are approximately tight when the budget is small
Clustered Linear Contextual Bandits with Knapsacks
In this work, we study clustered contextual bandits where rewards and
resource consumption are the outcomes of cluster-specific linear models. The
arms are divided in clusters, with the cluster memberships being unknown to an
algorithm. Pulling an arm in a time period results in a reward and in
consumption for each one of multiple resources, and with the total consumption
of any resource exceeding a constraint implying the termination of the
algorithm. Thus, maximizing the total reward requires learning not only models
about the reward and the resource consumption, but also cluster memberships. We
provide an algorithm that achieves regret sublinear in the number of time
periods, without requiring access to all of the arms. In particular, we show
that it suffices to perform clustering only once to a randomly selected subset
of the arms. To achieve this result, we provide a sophisticated combination of
techniques from the literature of econometrics and of bandits with constraints
High-dimensional Linear Bandits with Knapsacks
We study the contextual bandits with knapsack (CBwK) problem under the
high-dimensional setting where the dimension of the feature is large. The
reward of pulling each arm equals the multiplication of a sparse
high-dimensional weight vector and the feature of the current arrival, with
additional random noise. In this paper, we investigate how to exploit this
sparsity structure to achieve improved regret for the CBwK problem. To this
end, we first develop an online variant of the hard thresholding algorithm that
performs the sparse estimation in an online manner. We further combine our
online estimator with a primal-dual framework, where we assign a dual variable
to each knapsack constraint and utilize an online learning algorithm to update
the dual variable, thereby controlling the consumption of the knapsack
capacity. We show that this integrated approach allows us to achieve a
sublinear regret that depends logarithmically on the feature dimension, thus
improving the polynomial dependency established in the previous literature. We
also apply our framework to the high-dimension contextual bandit problem
without the knapsack constraint and achieve optimal regret in both the
data-poor regime and the data-rich regime. We finally conduct numerical
experiments to show the efficient empirical performance of our algorithms under
the high dimensional setting
- …