Search CORE

13 research outputs found

Semantic HELM: A Human-Readable Memory for Reinforcement Learning

Author: Adler Thomas
Hochreiter Sepp
Hofmarcher Markus
Paischer Fabian
Publication venue
Publication date: 27/10/2023
Field of study

Reinforcement learning agents deployed in the real world often have to cope with partially observable environments. Therefore, most agents employ memory mechanisms to approximate the state of the environment. Recently, there have been impressive success stories in mastering partially observable environments, mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft. However, existing methods lack interpretability in the sense that it is not comprehensible for humans what the agent stores in its memory. In this regard, we propose a novel memory mechanism that represents past events in human language. Our method uses CLIP to associate visual inputs with language tokens. Then we feed these tokens to a pretrained language model that serves the agent as memory and provides it with a coherent and human-readable representation of the past. We train our memory mechanism on a set of partially observable environments and find that it excels on tasks that require a memory component, while mostly attaining performance on-par with strong baselines on tasks that do not. On a challenging continuous recognition task, where memorizing the past is crucial, our memory mechanism converges two orders of magnitude faster than prior methods. Since our memory mechanism is human-readable, we can peek at an agent's memory and check whether crucial pieces of information have been stored. This significantly enhances troubleshooting and paves the way toward more interpretable agents.Comment: To appear at NeurIPS 2023, 10 pages (+ references and appendix), Code: https://github.com/ml-jku/hel

arXiv.org e-Print Archive

Learning to Modulate pre-trained Models in RL

Author: Hochreiter Sepp
Hofmarcher Markus
Paischer Fabian
Pascanu Razvan
Schmied Thomas
Publication venue
Publication date: 27/10/2023
Field of study

Reinforcement Learning (RL) has been successful in various domains like robotics, game playing, and simulation. While RL agents have shown impressive capabilities in their specific tasks, they insufficiently adapt to new tasks. In supervised learning, this adaptation problem is addressed by large-scale pre-training followed by fine-tuning to new down-stream tasks. Recently, pre-training on multiple tasks has been gaining traction in RL. However, fine-tuning a pre-trained model often suffers from catastrophic forgetting. That is, the performance on the pre-training tasks deteriorates when fine-tuning on new tasks. To investigate the catastrophic forgetting phenomenon, we first jointly pre-train a model on datasets from two benchmark suites, namely Meta-World and DMControl. Then, we evaluate and compare a variety of fine-tuning methods prevalent in natural language processing, both in terms of performance on new tasks, and how well performance on pre-training tasks is retained. Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly. Therefore, we propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model via a learnable modulation pool. Our method achieves state-of-the-art performance on the Continual-World benchmark, while retaining performance on the pre-training tasks. Finally, to aid future research in this area, we release a dataset encompassing 50 Meta-World and 16 DMControl tasks.Comment: 10 pages (+ references and appendix), Code: https://github.com/ml-jku/L2

arXiv.org e-Print Archive

A Dataset Perspective on Offline Reinforcement Learning

Author: Bitto-Nemling Angela
Dinu Marius-Constantin
Eghbal-zadeh Hamid
Hochreiter Sepp
Hofmarcher Markus
Patil Vihang
Radler Andreas
Schweighofer Kajetan
Publication venue
Publication date: 12/07/2022
Field of study

The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this fact, how dataset characteristics influence Offline RL algorithms is still hardly investigated. The dataset characteristics are determined by the behavioral policy that samples this dataset. Therefore, we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. We implement two corresponding empirical measures for the datasets sampled by the behavioral policy in deterministic MDPs. The first empirical measure SACo is defined by the normalized unique state-action pairs and captures exploration. The second empirical measure TQ is defined by the normalized average trajectory return and captures exploitation. Empirical evaluations show the effectiveness of TQ and SACo. In large-scale experiments using our proposed measures, we show that the unconstrained off-policy Deep Q-Network family requires datasets with high SACo to find a good policy. Furthermore, experiments show that policy constraint algorithms perform well on datasets with high TQ and SACo. Finally, the experiments show, that purely dataset-constrained Behavioral Cloning performs competitively to the best Offline RL algorithms for datasets with high TQ.Comment: Code: https://github.com/ml-jku/OfflineR

arXiv.org e-Print Archive

Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

Author: Arjona-Medina Jose A.
Blies Patrick M.
Brandstetter Johannes
Dinu Marius-Constantin
Dorfer Matthias
Hochreiter Sepp
Hofmarcher Markus
Patil Vihang P.
Publication venue
Publication date: 29/09/2020
Field of study

Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks can often be hierarchically decomposed into sub-tasks. A step in the Q-function can be associated with solving a sub-task, where the expectation of the return increases. RUDDER has been introduced to identify these steps and then redistribute reward to them, thus immediately giving reward if sub-tasks are solved. Since the problem of delayed rewards is mitigated, learning is considerably sped up. However, for complex tasks, current exploration strategies as deployed in RUDDER struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model as a deep learning method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations as known from bioinformatics. Align-RUDDER inherits the concept of reward redistribution, which considerably reduces the delay of rewards, thus speeding up learning. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Github: https://github.com/ml-jku/align-rudder, YouTube: https://youtu.be/HO-_8ZUl-U

arXiv.org e-Print Archive

Gesundheitsziele und -indikatoren als Steuerungsinstrument der sozialen Krankenversicherung: Im Auftrag des Hauptverbandes der österreichischen Sozialversicherungsträger

Author: Hofmarcher Maria M.
Kraus Markus
Riedel Monika
Publication venue: Institut für Höhere Studien
Publication date: 01/10/2004
Field of study

aus dem Inhaltsverzeichnis: Einleitung; Ausgangslage; Ziele; Verfahren und Kriterien zur Auswahl von Gesundheitszielen; Internationale Zusammenschau; Exemplarische Ableitung eines Zielkatalogs für Österreich; Literatur; Länderberichte; Anhang

IRIHS - Institutional Repository at IHS

Health Policy Monitor; Pharmaceutical Price Policy - Follow Up Report

Author: Bittschi Benjamin
Hofmarcher Maria M.
Kraus Markus
Publication venue
Publication date: 01/04/2008
Field of study

There are ongoing efforts to contain cost growth in the pharmaceutical sector. This article aims to report about the latest developments in this area especially in view of the introduction of a new system of pharmaceutical price policy in 2005. Additionally this report aims to evaluate the outcome of these measures by looking at expenditure growth trends. It further summarizes recent developments in this area

IRIHS - Institutional Repository at IHS

BIOGENDER: The Impact of New Biotechnologies on Gender Aspects in Health Insurance: Final Report ; Study commissioned by bm:bwk

Author: Czypionka Thomas
Hofmarcher Maria M.
Kraus Markus
Mauerer Gerlinde
Riedel Monika
Schnabl Alexander
Slavova Tatjana
Publication venue: Institut für Höhere Studien
Publication date: 01/01/2006
Field of study

IRIHS - Institutional Repository at IHS