Search CORE

7 research outputs found

Abstraction-Guided Modular Reinforcement Learning

Author: Ponnambalam C.T.
Publication venue
Publication date: 26/06/2023
Field of study

Reinforcement learning (RL) models the learning process of humans, but as exciting advances are made that use increasingly deep neural networks, some of the fundamental strengths of human learning are still underutilized by RL agents. One of the most exciting properties of RL is that it appears to be incredibly flexible, requiring no model or knowledge of the task to be solved. However, this thesis argues that RL is inherently inflexible for two main reasons: 1. If there is existing knowledge, incorporating this without compromising the optimality of the solution is highly non-trivial, and 2. RL solutions can not be easily transferred between tasks, and generally require complete retraining to guarantee that a solution will work in a new task. Humans, on the other hand, are very flexible learners. We easily transfer knowledge from one task to another, and can learn from knowledge that we learned in other tasks or that other people share with us. Humans are exceptionally good at abstraction, or developing conceptual understandings that allow us to extend knowledge to never-before seen experiences. No artificial agent nor neural network has displayed the abstraction and generalization capabilities of humans in such varied tasks and environments. Despite this, utilizing the human as a tool for abstraction is commonly done only at the stage of defining the model. In general, this means making choices about what to include in the state space that will make the problem solvable without adding unnecessary complexity. While necessary, this step is not explicitly referred to as abstraction, and it is generally not considered relevant to how RL is applied. Much of the research in RL is less focused on how the problem is modelled, and instead centers the development and application of computational advances that allow for solving bigger and bigger problems. Applying abstraction explicitly is highly non-trivial, as confirming that an abstract problem preserves the necessary information of the true problem can generally only be done if a full solution is already found, which may defeat the purpose of finding an abstraction if such a solution cannot be found. When such a confirmation can be made, the abstraction can be the result of a very complex function that would be difficult for a human to define. In this work, human-defined abstractions are used in a way that goes beyond the initial definition of the problem. The first approach, presented in Chapter 3, breaks a problem into several abstract problems, and uses the same experience to solve each at the same time. A meta-agent learns how to compose the learned policies together to find the optimal policy. In Chapter 4, a method is introduced that uses supervised learning to train a model on partially observable experience which is labelled with hindsight. The agent then learns a policy on predicted states, trading off information gathering with reward maximization. The last method presented in Chapter 5 is a modular approach to offline RL, where even with expert data, the method can become ineffective if the given data does not cover the entire problem space. This method introduces a second problem of recovering the agent to a state where it can safely follow the expert’s action. The method applies abstraction to multiply the given data and safely plan recovery policies. Combining the recovery policies with the imitation policy maintains high performance even when the expert data provided is limited. In the methods developed in this research, a learning-to-learn component enables the agent to relax the usually strict requirements of abstraction, the parallel processing allows the agent to learn more from fewer samples, and the modularity means that the agent can transfer its knowledge to other related tasks. <br/

TU Delft Repository

Abstraction-Guided Policy Recovery from Expert Demonstrations

Author: Oliehoek F.A. (author)
Ponnambalam C.T. (author)
Spaan M.T.J. (author)
Publication venue
Publication date: 01/01/2020
Field of study

The goal in behavior cloning is to extract meaningful information from expertdemonstrations and reproduce the same behavior autonomously. However, theavailable data is unlikely to exhaustively cover the potential problem space. As aresult, the quality of automated decision making is compromised without elegantways to handle the encountering of out-of-distribution states that might occur dueto unforeseen events in the environment. Our novel approach RECO uses only theoffline data available to recover a behavioral cloning agent from unknown states.Given expert trajectories, RECO learns both an imitation policy and recoverypolicy. Our contribution is a method for learning this recovery policy that steersthe agent back to the trajectories in the data set from unknown states. Whilethere is, per definition, no data available to learn the recovery policy, we exploitabstractions to generalize beyond the available data thus overcoming this problem.In a tabular domain, we show how our method results in drastically fewer calls to ahuman supervisor without compromising solution quality and with few trajectoriesprovided by an expert. We further introduce a continuous adaptation of RECO andevaluate its potential in an experiment.AlgorithmicsInteractive Intelligenc

TU Delft Repository

Association for the Advancement of Artificial Intelligence: AAAI Publications

Back to the Future: Solving Hidden Parameter MDPs with Hindsight

Author: Kamran Danial
Oliehoek F.A.
Ponnambalam C.T.
Simão T. D.
Spaan M.T.J.
Publication venue
Publication date: 01/01/2022
Field of study

PEBL: Pessimistic Ensembles for Offline Deep Reinforcement Learning

Author: Oliehoek F.A. (author)
Ponnambalam C.T. (author)
Smit Jordi (author)
Spaan M.T.J. (author)
Publication venue
Publication date: 01/01/2021
Field of study

Offline reinforcement learning (RL), or learning from a fixed data set, is an attractive alternative to online RL. Offline RL promises to address the cost and safety implications of tak- ing numerous random or bad actions online, a crucial aspect of traditional RL that makes it difficult to apply in real-world problems. However, when RL is na ̈ıvely applied to a fixed data set, the resulting policy may exhibit poor performance in the real environment. This happens due to over-estimation of the value of state-action pairs not sufficiently covered by the data set. A promising way to avoid this is by applying pessimism and acting according to a lower bound estimate on the value. It has been shown that penalizing the learned value according to a pessimistic bound on the uncertainty can drastically improve offline RL. In deep reinforcement learn- ing, however, uncertainty estimation is highly non-trivial and development of effective uncertainty-based pessimistic algo- rithms remains an open question. This paper introduces two novel offline deep RL methods built on Double Deep Q- Learning and Soft Actor-Critic. We show how a multi-headed bootstrap approach to uncertainty estimation is used to cal- culate an effective pessimistic value penalty. Our approach is applied to benchmark offline deep RL domains, where we demonstrate that our methods can often beat the current state- of-the-art.AlgorithmicsInteractive Intelligenc

TU Delft Repository

Interval Q-Learning: Balancing Deep and Wide Exploration

Author: de Weerdt M.M. (author)
Neustroev G. (author)
Ponnambalam C.T. (author)
Spaan M.T.J. (author)
Publication venue
Publication date: 01/01/2020
Field of study

Reinforcement learning requires exploration, leading to repeated execution of sub-optimal actions. Naive exploration techniques address this problem by changing gradually from exploration to exploitation. This approach employs a wide search resulting in exhaustive exploration and low sample-efficiency. More advanced search methods explore optimistically based on an upper bound estimate of expected rewards. These methods employ deep search, aiming to reach states not previously visited. Another deep search strategy is found in action-elimination methods, which aim to discover and eliminate sub-optimal actions. Despite the effectiveness of advanced deep search strategies, some problems are better suited to naive exploration. We devise a new method, called Interval Q-Learning, that finds a balance between wide and deep search. It assigns a small probability to taking sub-optimal actions and combines both greedy and optimistic exploration. This allows for fast convergence to a near-optimal policy, and then exploration around it. We demonstrate the performance of tabular and deep Q-network versions of Interval Q-Learning, showing that it offers convergence speed-up both in problems that favor wide exploration methods and those that favor deep search strategies.Virtual/online event due to COVID-19Algorithmic

TU Delft Repository

Back to the Future: Solving Hidden Parameter MDPs with Hindsight

Author: Kamran Danial (author)
Oliehoek F.A. (author)
Ponnambalam C.T. (author)
Simão T. D. (author)
Spaan M.T.J. (author)
Publication venue
Publication date: 01/01/2022
Field of study

AlgorithmicsInteractive Intelligenc

TU Delft Repository

A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics using Constrained Reinforcement Learning

Author: Fischer Johannes (author)
Kamran Danial (author)
Lauer Martin (author)
Ponnambalam C.T. (author)
Simão T. D. (author)
Spaan M.T.J. (author)
Yang Q. (author)
Publication venue: IEEE
Publication date: 01/01/2022
Field of study

The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositions come at the cost of flexibility and applying them often relies on complex parameters and hard-coded knowledge modelled by the reward function. Autonomous driving is one such domain that could greatly benefit from more efficient and verifiable methods for safe automation. We propose to approach the automated driving problem using constrained RL, a method that automates the trade off between risk and utility, thereby significantly reducing the burden on the designer. We first show that an engineered reward function for ensuring safety and utility in one specific environment might not result in the optimal behavior when traffic dynamics changes in the exact environment. Next we show how algorithms based on constrained RL which are more robust to the environmental disturbances can address this challenge. These algorithms use a simple and easy to interpret reward and cost function, and are able to maintain both, efficiency and safety without requiring reward parameter tuning. We demonstrate our approach in the automated merging scenario with different traffic configurations such as low or high chance of cooperative drivers and different cooperative driving strategies.Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Algorithmic

TU Delft Repository