1,573 research outputs found
Stochastic Contextual Bandits with Known Reward Functions
Many sequential decision-making problems in communication networks can be
modeled as contextual bandit problems, which are natural extensions of the
well-known multi-armed bandit problem. In contextual bandit problems, at each
time, an agent observes some side information or context, pulls one arm and
receives the reward for that arm. We consider a stochastic formulation where
the context-reward tuples are independently drawn from an unknown distribution
in each trial. Motivated by networking applications, we analyze a setting where
the reward is a known non-linear function of the context and the chosen arm's
current state. We first consider the case of discrete and finite context-spaces
and propose DCB(), an algorithm that we prove, through a careful
analysis, yields regret (cumulative reward gap compared to a distribution-aware
genie) scaling logarithmically in time and linearly in the number of arms that
are not optimal for any context, improving over existing algorithms where the
regret scales linearly in the total number of arms. We then study continuous
context-spaces with Lipschitz reward functions and propose CCB(), an algorithm that uses DCB() as a subroutine.
CCB() reveals a novel regret-storage trade-off that is
parametrized by . Tuning to the time horizon allows us to
obtain sub-linear regret bounds, while requiring sub-linear storage. By
exploiting joint learning for all contexts we get regret bounds for
CCB() that are unachievable by any existing contextual bandit
algorithm for continuous context-spaces. We also show similar performance
bounds for the unknown horizon case.Comment: A version of this technical report is under submission in IEEE/ACM
Transactions on Networkin
Policy Gradients for Contextual Recommendations
Decision making is a challenging task in online recommender systems. The
decision maker often needs to choose a contextual item at each step from a set
of candidates. Contextual bandit algorithms have been successfully deployed to
such applications, for the trade-off between exploration and exploitation and
the state-of-art performance on minimizing online costs. However, the
applicability of existing contextual bandit methods is limited by the
over-simplified assumptions of the problem, such as assuming a simple form of
the reward function or assuming a static environment where the states are not
affected by previous actions. In this work, we put forward Policy Gradients for
Contextual Recommendations (PGCR) to solve the problem without those
unrealistic assumptions. It optimizes over a restricted class of policies where
the marginal probability of choosing an item (in expectation of other items)
has a simple closed form, and the gradient of the expected return over the
policy in this class is in a succinct form. Moreover, PGCR leverages two useful
heuristic techniques called Time-Dependent Greed and Actor-Dropout. The former
ensures PGCR to be empirically greedy in the limit, and the latter addresses
the trade-off between exploration and exploitation by using the policy network
with Dropout as a Bayesian approximation. PGCR can solve the standard
contextual bandits as well as its Markov Decision Process generalization.
Therefore it can be applied to a wide range of realistic settings of
recommendations, such as personalized advertising. We evaluate PGCR on toy
datasets as well as a real-world dataset of personalized music recommendations.
Experiments show that PGCR enables fast convergence and low regret, and
outperforms both classic contextual-bandits and vanilla policy gradient
methods.Comment: Accepted at WWW-201
Multi-Objective Generalized Linear Bandits
In this paper, we study the multi-objective bandits (MOB) problem, where a
learner repeatedly selects one arm to play and then receives a reward vector
consisting of multiple objectives. MOB has found many real-world applications
as varied as online recommendation and network routing. On the other hand,
these applications typically contain contextual information that can guide the
learning process which, however, is ignored by most of existing work. To
utilize this information, we associate each arm with a context vector and
assume the reward follows the generalized linear model (GLM). We adopt the
notion of Pareto regret to evaluate the learner's performance and develop a
novel algorithm for minimizing it. The essential idea is to apply a variant of
the online Newton step to estimate model parameters, based on which we utilize
the upper confidence bound (UCB) policy to construct an approximation of the
Pareto front, and then uniformly at random choose one arm from the approximate
Pareto front. Theoretical analysis shows that the proposed algorithm achieves
an Pareto regret, where is the time horizon and
is the dimension of contexts, which matches the optimal result for single
objective contextual bandits problem. Numerical experiments demonstrate the
effectiveness of our method
A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit
Adaptive and sequential experiment design is a well-studied area in numerous
domains. We survey and synthesize the work of the online statistical learning
paradigm referred to as multi-armed bandits integrating the existing research
as a resource for a certain class of online experiments. We first explore the
traditional stochastic model of a multi-armed bandit, then explore a taxonomic
scheme of complications to that model, for each complication relating it to a
specific requirement or consideration of the experiment design context.
Finally, at the end of the paper, we present a table of known upper-bounds of
regret for all studied algorithms providing both perspectives for future
theoretical work and a decision-making tool for practitioners looking for
theoretical guarantees.Comment: 49 pages, 1 figur
Adapting multi-armed bandits policies to contextual bandits scenarios
This work explores adaptations of successful multi-armed bandits policies to
the online contextual bandits scenario with binary rewards using binary
classification algorithms such as logistic regression as black-box oracles.
Some of these adaptations are achieved through bootstrapping or approximate
bootstrapping, while others rely on other forms of randomness, resulting in
more scalable approaches than previous works, and the ability to work with any
type of classification algorithm. In particular, the Adaptive-Greedy algorithm
shows a lot of promise, in many cases achieving better performance than upper
confidence bound and Thompson sampling strategies, at the expense of more
hyperparameters to tune
Semiparametric Contextual Bandits
This paper studies semiparametric contextual bandits, a generalization of the
linear stochastic bandit problem where the reward for an action is modeled as a
linear function of known action features confounded by an non-linear
action-independent term. We design new algorithms that achieve
regret over rounds, when the linear function is
-dimensional, which matches the best known bounds for the simpler
unconfounded case and improves on a recent result of Greenewald et al. (2017).
Via an empirical evaluation, we show that our algorithms outperform prior
approaches when there are non-linear confounding effects on the rewards.
Technically, our algorithms use a new reward estimator inspired by
doubly-robust approaches and our proofs require new concentration inequalities
for self-normalized martingales
Estimation Considerations in Contextual Bandits
Contextual bandit algorithms are sensitive to the estimation method of the
outcome model as well as the exploration method used, particularly in the
presence of rich heterogeneity or complex outcome models, which can lead to
difficult estimation problems along the path of learning. We study a
consideration for the exploration vs. exploitation framework that does not
arise in multi-armed bandits but is crucial in contextual bandits; the way
exploration and exploitation is conducted in the present affects the bias and
variance in the potential outcome model estimation in subsequent stages of
learning. We develop parametric and non-parametric contextual bandits that
integrate balancing methods from the causal inference literature in their
estimation to make it less prone to problems of estimation bias. We provide the
first regret bound analyses for contextual bandits with balancing in the domain
of linear contextual bandits that match the state of the art regret bounds. We
demonstrate the strong practical advantage of balanced contextual bandits on a
large number of supervised learning datasets and on a synthetic example that
simulates model mis-specification and prejudice in the initial training data.
Additionally, we develop contextual bandits with simpler assignment policies by
leveraging sparse model estimation methods from the econometrics literature and
demonstrate empirically that in the early stages they can improve the rate of
learning and decrease regret
Nonparametric Stochastic Contextual Bandits
We analyze the -armed bandit problem where the reward for each arm is a
noisy realization based on an observed context under mild nonparametric
assumptions. We attain tight results for top-arm identification and a sublinear
regret of , where is the
context dimension, for a modified UCB algorithm that is simple to implement
(NN-UCB). We then give global intrinsic dimension dependent and ambient
dimension independent regret bounds. We also discuss recovering topological
structures within the context space based on expected bandit performance and
provide an extension to infinite-armed contextual bandits. Finally, we
experimentally show the improvement of our algorithm over existing multi-armed
bandit approaches for both simulated tasks and MNIST image classification.Comment: AAAI 201
Contextual Bandits with Latent Confounders: An NMF Approach
Motivated by online recommendation and advertising systems, we consider a
causal model for stochastic contextual bandits with a latent low-dimensional
confounder. In our model, there are observed contexts and arms of the
bandit. The observed context influences the reward obtained through a latent
confounder variable with cardinality (). The arm choice and the
latent confounder causally determines the reward while the observed context is
correlated with the confounder. Under this model, the mean reward
matrix (for each context in and each arm in )
factorizes into non-negative factors () and
(). This insight enables us to propose an
-greedy NMF-Bandit algorithm that designs a sequence of interventions
(selecting specific arms), that achieves a balance between learning this
low-dimensional structure and selecting the best arm to minimize regret. Our
algorithm achieves a regret of at time , as compared to for conventional
contextual bandits, assuming a constant gap between the best arm and the rest
for each context. These guarantees are obtained under mild sufficiency
conditions on the factors that are weaker versions of the well-known
Statistical RIP condition. We further propose a class of generative models that
satisfy our sufficient conditions, and derive a lower bound of
. These are the first regret guarantees for
online matrix completion with bandit feedback, when the rank is greater than
one. We further compare the performance of our algorithm with the state of the
art, on synthetic and real world data-sets.Comment: 37 pages, 2 figure
Fairness in Learning: Classic and Contextual Bandits
We introduce the study of fairness in multi-armed bandit problems. Our
fairness definition can be interpreted as demanding that given a pool of
applicants (say, for college admission or mortgages), a worse applicant is
never favored over a better one, despite a learning algorithm's uncertainty
over the true payoffs. We prove results of two types.
First, in the important special case of the classic stochastic bandits
problem (i.e., in which there are no contexts), we provide a provably fair
algorithm based on "chained" confidence intervals, and provide a cumulative
regret bound with a cubic dependence on the number of arms. We further show
that any fair algorithm must have such a dependence. When combined with regret
bounds for standard non-fair algorithms such as UCB, this proves a strong
separation between fair and unfair learning, which extends to the general
contextual case.
In the general contextual case, we prove a tight connection between fairness
and the KWIK (Knows What It Knows) learning model: a KWIK algorithm for a class
of functions can be transformed into a provably fair contextual bandit
algorithm, and conversely any fair contextual bandit algorithm can be
transformed into a KWIK learning algorithm. This tight connection allows us to
provide a provably fair algorithm for the linear contextual bandit problem with
a polynomial dependence on the dimension, and to show (for a different class of
functions) a worst-case exponential gap in regret between fair and non-fair
learning algorithmsComment: A condensed version of this work appears in the 30th Annual
Conference on Neural Information Processing Systems (NIPS), 201
- …