2,905 research outputs found
Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms
In this paper, we address the stochastic contextual linear bandit problem,
where a decision maker is provided a context (a random set of actions drawn
from a distribution). The expected reward of each action is specified by the
inner product of the action and an unknown parameter. The goal is to design an
algorithm that learns to play as close as possible to the unknown optimal
policy after a number of action plays. This problem is considered more
challenging than the linear bandit problem, which can be viewed as a contextual
bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we
show that the stochastic contextual problem can be solved as if it is a linear
bandit problem. In particular, we establish a novel reduction framework that
converts every stochastic contextual linear bandit instance to a linear bandit
instance, when the context distribution is known. When the context distribution
is unknown, we establish an algorithm that reduces the stochastic contextual
instance to a sequence of linear bandit instances with small misspecifications
and achieves nearly the same worst-case regret bound as the algorithm that
solves the misspecified linear bandit instances.
As a consequence, our results imply a high-probability
regret bound for contextual linear bandits, making progress in resolving an
open problem in (Li et al., 2019), (Li et al., 2021).
Our reduction framework opens up a new way to approach stochastic contextual
linear bandit problems, and enables improved regret bounds in a number of
instances including the batch setting, contextual bandits with
misspecifications, contextual bandits with sparse unknown parameters, and
contextual bandits with adversarial corruption
Optimal No-regret Learning in Repeated First-price Auctions
We study online learning in repeated first-price auctions with censored
feedback, where a bidder, only observing the winning bid at the end of each
auction, learns to adaptively bid in order to maximize her cumulative payoff.
To achieve this goal, the bidder faces a challenging dilemma: if she wins the
bid--the only way to achieve positive payoffs--then she is not able to observe
the highest bid of the other bidders, which we assume is iid drawn from an
unknown distribution. This dilemma, despite being reminiscent of the
exploration-exploitation trade-off in contextual bandits, cannot directly be
addressed by the existing UCB or Thompson sampling algorithms in that
literature, mainly because contrary to the standard bandits setting, when a
positive reward is obtained here, nothing about the environment can be learned.
In this paper, by exploiting the structural properties of first-price
auctions, we develop the first learning algorithm that achieves
regret bound when the bidder's private values are
stochastically generated. We do so by providing an algorithm on a general class
of problems, which we call monotone group contextual bandits, where the same
regret bound is established under stochastically generated contexts. Further,
by a novel lower bound argument, we characterize an lower
bound for the case where the contexts are adversarially generated, thus
highlighting the impact of the contexts generation mechanism on the fundamental
learning limit. Despite this, we further exploit the structure of first-price
auctions and develop a learning algorithm that operates sample-efficiently (and
computationally efficiently) in the presence of adversarially generated private
values. We establish an regret bound for this algorithm,
hence providing a complete characterization of optimal learning guarantees for
this problem
- …