221 research outputs found
Stationary Mixing Bandits
We study the bandit problem where arms are associated with stationary
phi-mixing processes and where rewards are therefore dependent: the question
that arises from this setting is that of recovering some independence by
ignoring the value of some rewards. As we shall see, the bandit problem we
tackle requires us to address the exploration/exploitation/independence
trade-off. To do so, we provide a UCB strategy together with a general regret
analysis for the case where the size of the independence blocks (the ignored
rewards) is fixed and we go a step beyond by providing an algorithm that is
able to compute the size of the independence blocks from the data. Finally, we
give an analysis of our bandit problem in the restless case, i.e., in the
situation where the time counters for all mixing processes simultaneously
evolve
Regret Bounds for Reinforcement Learning with Policy Advice
In some reinforcement learning problems an agent may be provided with a set
of input policies, perhaps learned from prior experience or provided by
advisors. We present a reinforcement learning with policy advice (RLPA)
algorithm which leverages this input set and learns to use the best policy in
the set for the reinforcement learning task at hand. We prove that RLPA has a
sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and
that both this regret and its computational complexity are independent of the
size of the state and action space. Our empirical simulations support our
theoretical analysis. This suggests RLPA may offer significant advantages in
large domains where some prior good policies are provided
Time-Varying Gaussian Process Bandit Optimization
We consider the sequential Bayesian optimization problem with bandit
feedback, adopting a formulation that allows for the reward function to vary
with time. We model the reward function using a Gaussian process whose
evolution obeys a simple Markov model. We introduce two natural extensions of
the classical Gaussian process upper confidence bound (GP-UCB) algorithm. The
first, R-GP-UCB, resets GP-UCB at regular intervals. The second, TV-GP-UCB,
instead forgets about old data in a smooth fashion. Our main contribution
comprises of novel regret bounds for these algorithms, providing an explicit
characterization of the trade-off between the time horizon and the rate at
which the function varies. We illustrate the performance of the algorithms on
both synthetic and real data, and we find the gradual forgetting of TV-GP-UCB
to perform favorably compared to the sharp resetting of R-GP-UCB. Moreover,
both algorithms significantly outperform classical GP-UCB, since it treats
stale and fresh data equally.Comment: To appear in AISTATS 201
- …