1,192 research outputs found
Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning
We consider an agent interacting with an environment in a single stream of
actions, observations, and rewards, with no reset. This process is not assumed
to be a Markov Decision Process (MDP). Rather, the agent has several
representations (mapping histories of past interactions to a discrete state
space) of the environment with unknown dynamics, only some of which result in
an MDP. The goal is to minimize the average regret criterion against an agent
who knows an MDP representation giving the highest optimal reward, and acts
optimally in it. Recent regret bounds for this setting are of order
with an additive term constant yet exponential in some
characteristics of the optimal MDP. We propose an algorithm whose regret after
time steps is , with all constants reasonably small. This is
optimal in since is the optimal regret in the setting of
learning in a (single discrete) MDP
Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
We derive sublinear regret bounds for undiscounted reinforcement learning in
continuous state space. The proposed algorithm combines state aggregation with
the use of upper confidence bounds for implementing optimism in the face of
uncertainty. Beside the existence of an optimal policy which satisfies the
Poisson equation, the only assumptions made are Holder continuity of rewards
and transition probabilities
Extreme State Aggregation Beyond MDPs
We consider a Reinforcement Learning setup where an agent interacts with an
environment in observation-reward-action cycles without any (esp.\ MDP)
assumptions on the environment. State aggregation and more generally feature
reinforcement learning is concerned with mapping histories/raw-states to
reduced/aggregated states. The idea behind both is that the resulting reduced
process (approximately) forms a small stationary finite-state MDP, which can
then be efficiently solved or learnt. We considerably generalize existing
aggregation results by showing that even if the reduced process is not an MDP,
the (q-)value functions and (optimal) policies of an associated MDP with same
state-space size solve the original problem, as long as the solution can
approximately be represented as a function of the reduced states. This implies
an upper bound on the required state space size that holds uniformly for all RL
problems. It may also explain why RL algorithms designed for MDPs sometimes
perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem
Regret Bounds for Learning State Representations in Reinforcement Learning
International audienceWe consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with O(√ T) regret in any communicating MDP. The regret bound shows that UCB-MS automatically adapts to the Markov model and improves over the currently known best bound of order O(T 2/3)
- …