11 research outputs found
Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
We consider a reinforcement learning setting introduced in (Maillard et al.,
NIPS 2011) where the learner does not have explicit access to the states of the
underlying Markov decision process (MDP). Instead, she has access to several
models that map histories of past interactions to states. Here we improve over
known regret bounds in this setting, and more importantly generalize to the
case where the models given to the learner do not contain a true model
resulting in an MDP representation but only approximations of it. We also give
improved error bounds for state aggregation
On overfitting and asymptotic bias in batch reinforcement learning with partial observability
This paper provides an analysis of the tradeoff between asymptotic bias
(suboptimality with unlimited data) and overfitting (additional suboptimality
due to limited data) in the context of reinforcement learning with partial
observability. Our theoretical analysis formally characterizes that while
potentially increasing the asymptotic bias, a smaller state representation
decreases the risk of overfitting. This analysis relies on expressing the
quality of a state representation by bounding L1 error terms of the associated
belief states. Theoretical results are empirically illustrated when the state
representation is a truncated history of observations, both on synthetic POMDPs
and on a large-scale POMDP in the context of smartgrids, with real-world data.
Finally, similarly to known results in the fully observable setting, we also
briefly discuss and empirically illustrate how using function approximators and
adapting the discount factor may enhance the tradeoff between asymptotic bias
and overfitting in the partially observable context.Comment: Accepted at the Journal of Artificial Intelligence Research (JAIR) -
31 page
Model-Based Reinforcement Learning Exploiting State-Action Equivalence
International audienceLeveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of SA/C, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs
An Analysis of Model-Based Reinforcement Learning From Abstracted Observations
Many methods for Model-based Reinforcement learning (MBRL) in Markov decision
processes (MDPs) provide guarantees for both the accuracy of the model they can
deliver and the learning efficiency. At the same time, state abstraction
techniques allow for a reduction of the size of an MDP while maintaining a
bounded loss with respect to the original problem. Therefore, it may come as a
surprise that no such guarantees are available when combining both techniques,
i.e., where MBRL merely observes abstract states. Our theoretical analysis
shows that abstraction can introduce a dependence between samples collected
online (e.g., in the real world). That means that, without taking this
dependence into account, results for MBRL do not directly extend to this
setting. Our result shows that we can use concentration inequalities for
martingales to overcome this problem. This result makes it possible to extend
the guarantees of existing MBRL algorithms to the setting with abstraction. We
illustrate this by combining R-MAX, a prototypical MBRL algorithm, with
abstraction, thus producing the first performance guarantees for model-based
'RL from Abstracted Observations': model-based reinforcement learning with an
abstract model.Comment: 36 pages, 2 figures, published in Transactions on Machine Learning
Research (TMLR) 202
Upper Confidence Reinforcement Learning exploiting state-action equivalence
Leveraging an equivalence property on the set of states of state-action pairs in anMarkov Decision Process (MDP) has been suggested by many authors. We takethe study of equivalence classes to the reinforcement learning (RL) setup, whentransition distributions are no longer assumed to be known, in a discrete MDP withaverage reward criterion and no reset. We study powerful similarities betweenstate-action pairs related to optimal transport. We first analyze a variant of theUCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveragingthis equivalence structure when it is known ahead of time: the regret bound scalesas ~O(D√KCT) where C is the number of classes of equivalent state-action pairsand K bounds the size of the support of the transitions. A non trivial question iswhether this benefit can still be observed when the structure is unknown and mustbe learned while minimizing the regret. We propose a sound clustering techniquethat provably learn the unknown classes, but show that its natural combination withUCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterionfor stopping the episodes in UCRL2. We replace it with hypothesis testing, whichin turns considerably improves all strategies. It is then empirically validated thatlearning the structure can be beneficial in a full-blown RL problem
Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
International audienceWe consider a reinforcement learning setting where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation
Reinforcement Learning of POMDPs using Spectral Methods
We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound w.r.t. the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces