268 research outputs found
Misspecified Linear Bandits
We consider the problem of online learning in misspecified linear stochastic
multi-armed bandit problems. Regret guarantees for state-of-the-art linear
bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit
(OFUL) hold under the assumption that the arms expected rewards are perfectly
linear in their features. It is, however, of interest to investigate the impact
of potential misspecification in linear bandit models, where the expected
rewards are perturbed away from the linear subspace determined by the arms
features. Although OFUL has recently been shown to be robust to relatively
small deviations from linearity, we show that any linear bandit algorithm that
enjoys optimal regret performance in the perfectly linear setting (e.g., OFUL)
must suffer linear regret under a sparse additive perturbation of the linear
model. In an attempt to overcome this negative result, we define a natural
class of bandit models characterized by a non-sparse deviation from linearity.
We argue that the OFUL algorithm can fail to achieve sublinear regret even
under models that have non-sparse deviation.We finally develop a novel bandit
algorithm, comprising a hypothesis test for linearity followed by a decision to
use either the OFUL or Upper Confidence Bound (UCB) algorithm. For perfectly
linear bandit models, the algorithm provably exhibits OFULs favorable regret
performance, while for misspecified models satisfying the non-sparse deviation
property, the algorithm avoids the linear regret phenomenon and falls back on
UCBs sublinear regret scaling. Numerical experiments on synthetic data, and on
recommendation data from the public Yahoo! Learning to Rank Challenge dataset,
empirically support our findings.Comment: Thirty-First AAAI Conference on Artificial Intelligence, 201
Linear Bandits with Memory: from Rotting to Rising
Nonstationary phenomena, such as satiation effects in recommendation, are a
common feature of sequential decision-making problems. While these phenomena
have been mostly studied in the framework of bandits with finitely many arms,
in many practically relevant cases linear bandits provide a more effective
modeling choice. In this work, we introduce a general framework for the study
of nonstationary linear bandits, where current rewards are influenced by the
learner's past actions in a fixed-size window. In particular, our model
includes stationary linear bandits as a special case. After showing that the
best sequence of actions is NP-hard to compute in our model, we focus on cyclic
policies and prove a regret bound for a variant of the OFUL algorithm that
balances approximation and estimation errors. Our theoretical findings are
supported by experiments (which also include misspecified settings) where our
algorithm is seen to perform well against natural baselines
Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms
In this paper, we address the stochastic contextual linear bandit problem,
where a decision maker is provided a context (a random set of actions drawn
from a distribution). The expected reward of each action is specified by the
inner product of the action and an unknown parameter. The goal is to design an
algorithm that learns to play as close as possible to the unknown optimal
policy after a number of action plays. This problem is considered more
challenging than the linear bandit problem, which can be viewed as a contextual
bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we
show that the stochastic contextual problem can be solved as if it is a linear
bandit problem. In particular, we establish a novel reduction framework that
converts every stochastic contextual linear bandit instance to a linear bandit
instance, when the context distribution is known. When the context distribution
is unknown, we establish an algorithm that reduces the stochastic contextual
instance to a sequence of linear bandit instances with small misspecifications
and achieves nearly the same worst-case regret bound as the algorithm that
solves the misspecified linear bandit instances.
As a consequence, our results imply a high-probability
regret bound for contextual linear bandits, making progress in resolving an
open problem in (Li et al., 2019), (Li et al., 2021).
Our reduction framework opens up a new way to approach stochastic contextual
linear bandit problems, and enables improved regret bounds in a number of
instances including the batch setting, contextual bandits with
misspecifications, contextual bandits with sparse unknown parameters, and
contextual bandits with adversarial corruption
Optimal Model Selection in Contextual Bandits with Many Classes via Offline Oracles
We study the problem of model selection for contextual bandits, in which the
algorithm must balance the bias-variance trade-off for model estimation while
also balancing the exploration-exploitation trade-off. In this paper, we
propose the first reduction of model selection in contextual bandits to offline
model selection oracles, allowing for flexible general purpose algorithms with
computational requirements no worse than those for model selection for
regression. Our main result is a new model selection guarantee for stochastic
contextual bandits. When one of the classes in our set is realizable, up to a
logarithmic dependency on the number of classes, our algorithm attains optimal
realizability-based regret bounds for that class under one of two conditions:
if the time-horizon is large enough, or if an assumption that helps with
detecting misspecification holds. Hence our algorithm adapts to the complexity
of this unknown class. Even when this realizable class is known, we prove
improved regret guarantees in early rounds by relying on simpler model classes
for those rounds and hence further establish the importance of model selection
in contextual bandits
- …