2,201 research outputs found
Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning
We consider an agent interacting with an environment in a single stream of
actions, observations, and rewards, with no reset. This process is not assumed
to be a Markov Decision Process (MDP). Rather, the agent has several
representations (mapping histories of past interactions to a discrete state
space) of the environment with unknown dynamics, only some of which result in
an MDP. The goal is to minimize the average regret criterion against an agent
who knows an MDP representation giving the highest optimal reward, and acts
optimally in it. Recent regret bounds for this setting are of order
with an additive term constant yet exponential in some
characteristics of the optimal MDP. We propose an algorithm whose regret after
time steps is , with all constants reasonably small. This is
optimal in since is the optimal regret in the setting of
learning in a (single discrete) MDP
Regret Bounds for Reinforcement Learning with Policy Advice
In some reinforcement learning problems an agent may be provided with a set
of input policies, perhaps learned from prior experience or provided by
advisors. We present a reinforcement learning with policy advice (RLPA)
algorithm which leverages this input set and learns to use the best policy in
the set for the reinforcement learning task at hand. We prove that RLPA has a
sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and
that both this regret and its computational complexity are independent of the
size of the state and action space. Our empirical simulations support our
theoretical analysis. This suggests RLPA may offer significant advantages in
large domains where some prior good policies are provided
Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
We consider a reinforcement learning setting introduced in (Maillard et al.,
NIPS 2011) where the learner does not have explicit access to the states of the
underlying Markov decision process (MDP). Instead, she has access to several
models that map histories of past interactions to states. Here we improve over
known regret bounds in this setting, and more importantly generalize to the
case where the models given to the learner do not contain a true model
resulting in an MDP representation but only approximations of it. We also give
improved error bounds for state aggregation
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
We derive sublinear regret bounds for undiscounted reinforcement learning in
continuous state space. The proposed algorithm combines state aggregation with
the use of upper confidence bounds for implementing optimism in the face of
uncertainty. Beside the existence of an optimal policy which satisfies the
Poisson equation, the only assumptions made are Holder continuity of rewards
and transition probabilities
Extreme State Aggregation Beyond MDPs
We consider a Reinforcement Learning setup where an agent interacts with an
environment in observation-reward-action cycles without any (esp.\ MDP)
assumptions on the environment. State aggregation and more generally feature
reinforcement learning is concerned with mapping histories/raw-states to
reduced/aggregated states. The idea behind both is that the resulting reduced
process (approximately) forms a small stationary finite-state MDP, which can
then be efficiently solved or learnt. We considerably generalize existing
aggregation results by showing that even if the reduced process is not an MDP,
the (q-)value functions and (optimal) policies of an associated MDP with same
state-space size solve the original problem, as long as the solution can
approximately be represented as a function of the reduced states. This implies
an upper bound on the required state space size that holds uniformly for all RL
problems. It may also explain why RL algorithms designed for MDPs sometimes
perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem
Regret Bounds for Learning State Representations in Reinforcement Learning
International audienceWe consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with O(√ T) regret in any communicating MDP. The regret bound shows that UCB-MS automatically adapts to the Markov model and improves over the currently known best bound of order O(T 2/3)
- …