136 research outputs found
Clustered Linear Contextual Bandits with Knapsacks
In this work, we study clustered contextual bandits where rewards and
resource consumption are the outcomes of cluster-specific linear models. The
arms are divided in clusters, with the cluster memberships being unknown to an
algorithm. Pulling an arm in a time period results in a reward and in
consumption for each one of multiple resources, and with the total consumption
of any resource exceeding a constraint implying the termination of the
algorithm. Thus, maximizing the total reward requires learning not only models
about the reward and the resource consumption, but also cluster memberships. We
provide an algorithm that achieves regret sublinear in the number of time
periods, without requiring access to all of the arms. In particular, we show
that it suffices to perform clustering only once to a randomly selected subset
of the arms. To achieve this result, we provide a sophisticated combination of
techniques from the literature of econometrics and of bandits with constraints
Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression
We consider contextual bandits with linear constraints (CBwLC), a variant of
contextual bandits in which the algorithm consumes multiple resources subject
to linear constraints on total consumption. This problem generalizes contextual
bandits with knapsacks (CBwK), allowing for packing and covering constraints,
as well as positive and negative resource consumption. We provide the first
algorithm for CBwLC (or CBwK) that is based on regression oracles. The
algorithm is simple, computationally efficient, and admits vanishing regret. It
is statistically optimal for the variant of CBwK in which the algorithm must
stop once some constraint is violated. Further, we provide the first
vanishing-regret guarantees for CBwLC (or CBwK) that extend beyond the
stochastic environment. We side-step strong impossibility results from prior
work by identifying a weaker (and, arguably, fairer) benchmark to compare
against. Our algorithm builds on LagrangeBwK (Immorlica et al., FOCS 2019), a
Lagrangian-based technique for CBwK, and SquareCB (Foster and Rakhlin, ICML
2020), a regression-based technique for contextual bandits. Our analysis
leverages the inherent modularity of both techniques
Incorporating Behavioral Constraints in Online AI Systems
AI systems that learn through reward feedback about the actions they take are
increasingly deployed in domains that have significant impact on our daily
life. However, in many cases the online rewards should not be the only guiding
criteria, as there are additional constraints and/or priorities imposed by
regulations, values, preferences, or ethical principles. We detail a novel
online agent that learns a set of behavioral constraints by observation and
uses these learned constraints as a guide when making decisions in an online
setting while still being reactive to reward feedback. To define this agent, we
propose to adopt a novel extension to the classical contextual multi-armed
bandit setting and we provide a new algorithm called Behavior Constrained
Thompson Sampling (BCTS) that allows for online learning while obeying
exogenous constraints. Our agent learns a constrained policy that implements
the observed behavioral constraints demonstrated by a teacher agent, and then
uses this constrained policy to guide the reward-based online exploration and
exploitation. We characterize the upper bound on the expected regret of the
contextual bandit algorithm that underlies our agent and provide a case study
with real world data in two application domains. Our experiments show that the
designed agent is able to act within the set of behavior constraints without
significantly degrading its overall reward performance.Comment: 9 pages, 6 figure
Adversarial Bandits with Knapsacks
We consider Bandits with Knapsacks (henceforth, BwK), a general model for
multi-armed bandits under supply/budget constraints. In particular, a bandit
algorithm needs to solve a well-known knapsack problem: find an optimal packing
of items into a limited-size knapsack. The BwK problem is a common
generalization of numerous motivating examples, which range from dynamic
pricing to repeated auctions to dynamic ad allocation to network routing and
scheduling. While the prior work on BwK focused on the stochastic version, we
pioneer the other extreme in which the outcomes can be chosen adversarially.
This is a considerably harder problem, compared to both the stochastic version
and the "classic" adversarial bandits, in that regret minimization is no longer
feasible. Instead, the objective is to minimize the competitive ratio: the
ratio of the benchmark reward to the algorithm's reward.
We design an algorithm with competitive ratio O(log T) relative to the best
fixed distribution over actions, where T is the time horizon; we also prove a
matching lower bound. The key conceptual contribution is a new perspective on
the stochastic version of the problem. We suggest a new algorithm for the
stochastic version, which builds on the framework of regret minimization in
repeated games and admits a substantially simpler analysis compared to prior
work. We then analyze this algorithm for the adversarial version and use it as
a subroutine to solve the latter.Comment: Extended abstract appeared in FOCS 201
- …