Search CORE

101 research outputs found

Efficient Optimal Learning for Contextual Bandits

Author: Dudik Miroslav
Hsu Daniel
Kale Satyen
Karampatziakis Nikos
Langford John
Reyzin Lev
Zhang Tong
Publication venue
Publication date: 01/01/2011
Field of study

We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time

\mathrm{polylog}(N)

, where

N

is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work

arXiv.org e-Print Archive

CiteSeerX

Contextual Bandit Learning with Predictable Rewards

Author: Agarwal Alekh
Dudík Miroslav
Kale Satyen
Langford John
Schapire Robert E.
Publication venue
Publication date: 01/01/2012
Field of study

Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a realizability assumption: there exists a function in a (known) function class, always capable of predicting the expected reward, given the action and context. Under this assumption, we show three things. We present a new algorithm---Regressor Elimination--- with a regret similar to the agnostic setting (i.e. in the absence of realizability assumption). We prove a new lower bound showing no algorithm can achieve superior performance in the worst case even with the realizability assumption. However, we do show that for any set of policies (mapping contexts to actions), there is a distribution over rewards (given context) such that our new algorithm has constant regret unlike the previous approaches

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository