71 research outputs found
Abstraction in Reinforcement Learning
Abstrakce je důležitý nástroj pro inteligentního agenta. Pomáhá mu řešit složité úlohy tím, že ignoruje nedůležité detaily. V této práci popíši nový algoritmus pro hledání abstrakcí, Online Partition Iteration, který je založený na teorii homomorfismů Markovských rozhodovacích procesů. Můj algoritmus dokáže vytvořit abstrakce ze zkušeností nasbíraných agentem v prostředích s vysokodimenzionálními stavy a velkým množství dostupných akcí. Také představím nový přístup k přenášení abstrakcí mezi různými úlohami, který dosáhl nelpších výsledků ve většině mých experimentů. Nakonec dokážu správnost svého algoritmu pro hledání abstrakcí.Abstraction is an important tool for an intelligent agent. It can help the agent act in complex environments by selecting which details are important and which to ignore. In my thesis, I describe a novel abstraction algorithm called Online Partition Iteration, which is based on the theory of Markov Decision Process homomorphisms. The algorithm can find abstractions from a stream of collected experience in high-dimensional environments. I also introduce a technique for transferring the found abstractions between tasks that outperforms a deep Q-network baseline in the majority of my experiments. Finally, I prove the correctness of my abstraction algorithm
A Simple Approach for State-Action Abstraction using a Learned MDP Homomorphism
Animals are able to rapidly infer from limited experience when sets of state
action pairs have equivalent reward and transition dynamics. On the other hand,
modern reinforcement learning systems must painstakingly learn through trial
and error that sets of state action pairs are value equivalent -- requiring an
often prohibitively large amount of samples from their environment. MDP
homomorphisms have been proposed that reduce the observed MDP of an environment
to an abstract MDP, which can enable more sample efficient policy learning.
Consequently, impressive improvements in sample efficiency have been achieved
when a suitable MDP homomorphism can be constructed a priori -- usually by
exploiting a practioner's knowledge of environment symmetries. We propose a
novel approach to constructing a homomorphism in discrete action spaces, which
uses a partial model of environment dynamics to infer which state action pairs
lead to the same state -- reducing the size of the state-action space by a
factor equal to the cardinality of the action space. We call this method
equivalent effect abstraction. In a gridworld setting, we demonstrate
empirically that equivalent effect abstraction can improve sample efficiency in
a model-free setting and planning efficiency for modelbased approaches.
Furthermore, we show on cartpole that our approach outperforms an existing
method for learning homomorphisms, while using 33x less training data.Comment: Previously Presented at the Multi-disciplinary Conference on
Reinforcement Learning and Decision Making (RLDM) 202
On overfitting and asymptotic bias in batch reinforcement learning with partial observability
This paper provides an analysis of the tradeoff between asymptotic bias
(suboptimality with unlimited data) and overfitting (additional suboptimality
due to limited data) in the context of reinforcement learning with partial
observability. Our theoretical analysis formally characterizes that while
potentially increasing the asymptotic bias, a smaller state representation
decreases the risk of overfitting. This analysis relies on expressing the
quality of a state representation by bounding L1 error terms of the associated
belief states. Theoretical results are empirically illustrated when the state
representation is a truncated history of observations, both on synthetic POMDPs
and on a large-scale POMDP in the context of smartgrids, with real-world data.
Finally, similarly to known results in the fully observable setting, we also
briefly discuss and empirically illustrate how using function approximators and
adapting the discount factor may enhance the tradeoff between asymptotic bias
and overfitting in the partially observable context.Comment: Accepted at the Journal of Artificial Intelligence Research (JAIR) -
31 page
MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning
This paper introduces MDP homomorphic networks for deep reinforcement
learning. MDP homomorphic networks are neural networks that are equivariant
under symmetries in the joint state-action space of an MDP. Current approaches
to deep reinforcement learning do not usually exploit knowledge about such
structure. By building this prior knowledge into policy and value networks
using an equivariance constraint, we can reduce the size of the solution space.
We specifically focus on group-structured symmetries (invertible
transformations). Additionally, we introduce an easy method for constructing
equivariant network layers numerically, so the system designer need not solve
the constraints by hand, as is typically done. We construct MDP homomorphic
MLPs and CNNs that are equivariant under either a group of reflections or
rotations. We show that such networks converge faster than unstructured
baselines on CartPole, a grid world and Pong
A taxonomy for similarity metrics between Markov decision processes
Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer learning. Transfer is based on the idea of reusing the knowledge acquired in the learning of a set of source tasks to a new learning process in a target task, assuming that the target and source tasks are close enough. In recent years, transfer learning has succeeded in making reinforcement learning (RL) algorithms more efficient (e.g., by reducing the number of samples needed to achieve (near-)optimal performance). Transfer in RL is based on the core concept of similarity: whenever the tasks are similar, the transferred knowledge can be reused to solve the target task and significantly improve the learning performance. Therefore, the selection of good metrics to measure these similarities is a critical aspect when building transfer RL algorithms, especially when this knowledge is transferred from simulation to the real world. In the literature, there are many metrics to measure the similarity between MDPs, hence, many definitions of similarity or its complement distance have been considered. In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far, taking into account such categorization. We also follow this taxonomy to survey the existing literature, as well as suggesting future directions for the construction of new metricsOpen Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has also been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M17), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)S
Offline Reinforcement Learning with Pseudometric Learning
Offline Reinforcement Learning methods seek to learn a policy from logged
transitions of an environment, without any interaction. In the presence of
function approximation, and under the assumption of limited coverage of the
state-action space of the environment, it is necessary to enforce the policy to
visit state-action pairs close to the support of logged transitions. In this
work, we propose an iterative procedure to learn a pseudometric (closely
related to bisimulation metrics) from logged transitions, and use it to define
this notion of closeness. We show its convergence and extend it to the function
approximation setting. We then use this pseudometric to define a new lookup
based bonus in an actor-critic algorithm: PLOFF. This bonus encourages the
actor to stay close, in terms of the defined pseudometric, to the support of
logged transitions. Finally, we evaluate the method on hand manipulation and
locomotion tasks.Comment: ICML 202
- …