71 research outputs found

    Abstraction in Reinforcement Learning

    Get PDF
    Abstrakce je důležitý nástroj pro inteligentního agenta. Pomáhá mu řešit složité úlohy tím, že ignoruje nedůležité detaily. V této práci popíši nový algoritmus pro hledání abstrakcí, Online Partition Iteration, který je založený na teorii homomorfismů Markovských rozhodovacích procesů. Můj algoritmus dokáže vytvořit abstrakce ze zkušeností nasbíraných agentem v prostředích s vysokodimenzionálními stavy a velkým množství dostupných akcí. Také představím nový přístup k přenášení abstrakcí mezi různými úlohami, který dosáhl nelpších výsledků ve většině mých experimentů. Nakonec dokážu správnost svého algoritmu pro hledání abstrakcí.Abstraction is an important tool for an intelligent agent. It can help the agent act in complex environments by selecting which details are important and which to ignore. In my thesis, I describe a novel abstraction algorithm called Online Partition Iteration, which is based on the theory of Markov Decision Process homomorphisms. The algorithm can find abstractions from a stream of collected experience in high-dimensional environments. I also introduce a technique for transferring the found abstractions between tasks that outperforms a deep Q-network baseline in the majority of my experiments. Finally, I prove the correctness of my abstraction algorithm

    A Simple Approach for State-Action Abstraction using a Learned MDP Homomorphism

    Full text link
    Animals are able to rapidly infer from limited experience when sets of state action pairs have equivalent reward and transition dynamics. On the other hand, modern reinforcement learning systems must painstakingly learn through trial and error that sets of state action pairs are value equivalent -- requiring an often prohibitively large amount of samples from their environment. MDP homomorphisms have been proposed that reduce the observed MDP of an environment to an abstract MDP, which can enable more sample efficient policy learning. Consequently, impressive improvements in sample efficiency have been achieved when a suitable MDP homomorphism can be constructed a priori -- usually by exploiting a practioner's knowledge of environment symmetries. We propose a novel approach to constructing a homomorphism in discrete action spaces, which uses a partial model of environment dynamics to infer which state action pairs lead to the same state -- reducing the size of the state-action space by a factor equal to the cardinality of the action space. We call this method equivalent effect abstraction. In a gridworld setting, we demonstrate empirically that equivalent effect abstraction can improve sample efficiency in a model-free setting and planning efficiency for modelbased approaches. Furthermore, we show on cartpole that our approach outperforms an existing method for learning homomorphisms, while using 33x less training data.Comment: Previously Presented at the Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM) 202

    On overfitting and asymptotic bias in batch reinforcement learning with partial observability

    Full text link
    This paper provides an analysis of the tradeoff between asymptotic bias (suboptimality with unlimited data) and overfitting (additional suboptimality due to limited data) in the context of reinforcement learning with partial observability. Our theoretical analysis formally characterizes that while potentially increasing the asymptotic bias, a smaller state representation decreases the risk of overfitting. This analysis relies on expressing the quality of a state representation by bounding L1 error terms of the associated belief states. Theoretical results are empirically illustrated when the state representation is a truncated history of observations, both on synthetic POMDPs and on a large-scale POMDP in the context of smartgrids, with real-world data. Finally, similarly to known results in the fully observable setting, we also briefly discuss and empirically illustrate how using function approximators and adapting the discount factor may enhance the tradeoff between asymptotic bias and overfitting in the partially observable context.Comment: Accepted at the Journal of Artificial Intelligence Research (JAIR) - 31 page

    MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning

    Get PDF
    This paper introduces MDP homomorphic networks for deep reinforcement learning. MDP homomorphic networks are neural networks that are equivariant under symmetries in the joint state-action space of an MDP. Current approaches to deep reinforcement learning do not usually exploit knowledge about such structure. By building this prior knowledge into policy and value networks using an equivariance constraint, we can reduce the size of the solution space. We specifically focus on group-structured symmetries (invertible transformations). Additionally, we introduce an easy method for constructing equivariant network layers numerically, so the system designer need not solve the constraints by hand, as is typically done. We construct MDP homomorphic MLPs and CNNs that are equivariant under either a group of reflections or rotations. We show that such networks converge faster than unstructured baselines on CartPole, a grid world and Pong

    A taxonomy for similarity metrics between Markov decision processes

    Get PDF
    Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer learning. Transfer is based on the idea of reusing the knowledge acquired in the learning of a set of source tasks to a new learning process in a target task, assuming that the target and source tasks are close enough. In recent years, transfer learning has succeeded in making reinforcement learning (RL) algorithms more efficient (e.g., by reducing the number of samples needed to achieve (near-)optimal performance). Transfer in RL is based on the core concept of similarity: whenever the tasks are similar, the transferred knowledge can be reused to solve the target task and significantly improve the learning performance. Therefore, the selection of good metrics to measure these similarities is a critical aspect when building transfer RL algorithms, especially when this knowledge is transferred from simulation to the real world. In the literature, there are many metrics to measure the similarity between MDPs, hence, many definitions of similarity or its complement distance have been considered. In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far, taking into account such categorization. We also follow this taxonomy to survey the existing literature, as well as suggesting future directions for the construction of new metricsOpen Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has also been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M17), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)S

    Offline Reinforcement Learning with Pseudometric Learning

    Full text link
    Offline Reinforcement Learning methods seek to learn a policy from logged transitions of an environment, without any interaction. In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs close to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric (closely related to bisimulation metrics) from logged transitions, and use it to define this notion of closeness. We show its convergence and extend it to the function approximation setting. We then use this pseudometric to define a new lookup based bonus in an actor-critic algorithm: PLOFF. This bonus encourages the actor to stay close, in terms of the defined pseudometric, to the support of logged transitions. Finally, we evaluate the method on hand manipulation and locomotion tasks.Comment: ICML 202
    corecore