101 research outputs found
Efficient Optimal Learning for Contextual Bandits
We address the problem of learning in an online setting where the learner
repeatedly observes features, selects among a set of actions, and receives
reward for the action taken. We provide the first efficient algorithm with an
optimal regret. Our algorithm uses a cost sensitive classification learner as
an oracle and has a running time , where is the number
of classification rules among which the oracle might choose. This is
exponentially faster than all previous algorithms that achieve optimal regret
in this setting. Our formulation also enables us to create an algorithm with
regret that is additive rather than multiplicative in feedback delay as in all
previous work
Contextual Bandit Learning with Predictable Rewards
Contextual bandit learning is a reinforcement learning problem where the
learner repeatedly receives a set of features (context), takes an action and
receives a reward based on the action and context. We consider this problem
under a realizability assumption: there exists a function in a (known) function
class, always capable of predicting the expected reward, given the action and
context. Under this assumption, we show three things. We present a new
algorithm---Regressor Elimination--- with a regret similar to the agnostic
setting (i.e. in the absence of realizability assumption). We prove a new lower
bound showing no algorithm can achieve superior performance in the worst case
even with the realizability assumption. However, we do show that for any set of
policies (mapping contexts to actions), there is a distribution over rewards
(given context) such that our new algorithm has constant regret unlike the
previous approaches
- …