13 research outputs found
Semantic HELM: A Human-Readable Memory for Reinforcement Learning
Reinforcement learning agents deployed in the real world often have to cope
with partially observable environments. Therefore, most agents employ memory
mechanisms to approximate the state of the environment. Recently, there have
been impressive success stories in mastering partially observable environments,
mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft.
However, existing methods lack interpretability in the sense that it is not
comprehensible for humans what the agent stores in its memory. In this regard,
we propose a novel memory mechanism that represents past events in human
language. Our method uses CLIP to associate visual inputs with language tokens.
Then we feed these tokens to a pretrained language model that serves the agent
as memory and provides it with a coherent and human-readable representation of
the past. We train our memory mechanism on a set of partially observable
environments and find that it excels on tasks that require a memory component,
while mostly attaining performance on-par with strong baselines on tasks that
do not. On a challenging continuous recognition task, where memorizing the past
is crucial, our memory mechanism converges two orders of magnitude faster than
prior methods. Since our memory mechanism is human-readable, we can peek at an
agent's memory and check whether crucial pieces of information have been
stored. This significantly enhances troubleshooting and paves the way toward
more interpretable agents.Comment: To appear at NeurIPS 2023, 10 pages (+ references and appendix),
Code: https://github.com/ml-jku/hel
Learning to Modulate pre-trained Models in RL
Reinforcement Learning (RL) has been successful in various domains like
robotics, game playing, and simulation. While RL agents have shown impressive
capabilities in their specific tasks, they insufficiently adapt to new tasks.
In supervised learning, this adaptation problem is addressed by large-scale
pre-training followed by fine-tuning to new down-stream tasks. Recently,
pre-training on multiple tasks has been gaining traction in RL. However,
fine-tuning a pre-trained model often suffers from catastrophic forgetting.
That is, the performance on the pre-training tasks deteriorates when
fine-tuning on new tasks. To investigate the catastrophic forgetting
phenomenon, we first jointly pre-train a model on datasets from two benchmark
suites, namely Meta-World and DMControl. Then, we evaluate and compare a
variety of fine-tuning methods prevalent in natural language processing, both
in terms of performance on new tasks, and how well performance on pre-training
tasks is retained. Our study shows that with most fine-tuning approaches, the
performance on pre-training tasks deteriorates significantly. Therefore, we
propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation
of learned skills by modulating the information flow of the frozen pre-trained
model via a learnable modulation pool. Our method achieves state-of-the-art
performance on the Continual-World benchmark, while retaining performance on
the pre-training tasks. Finally, to aid future research in this area, we
release a dataset encompassing 50 Meta-World and 16 DMControl tasks.Comment: 10 pages (+ references and appendix), Code:
https://github.com/ml-jku/L2
A Dataset Perspective on Offline Reinforcement Learning
The application of Reinforcement Learning (RL) in real world environments can
be expensive or risky due to sub-optimal policies during training. In Offline
RL, this problem is avoided since interactions with an environment are
prohibited. Policies are learned from a given dataset, which solely determines
their performance. Despite this fact, how dataset characteristics influence
Offline RL algorithms is still hardly investigated. The dataset characteristics
are determined by the behavioral policy that samples this dataset. Therefore,
we define characteristics of behavioral policies as exploratory for yielding
high expected information in their interaction with the Markov Decision Process
(MDP) and as exploitative for having high expected return. We implement two
corresponding empirical measures for the datasets sampled by the behavioral
policy in deterministic MDPs. The first empirical measure SACo is defined by
the normalized unique state-action pairs and captures exploration. The second
empirical measure TQ is defined by the normalized average trajectory return and
captures exploitation. Empirical evaluations show the effectiveness of TQ and
SACo. In large-scale experiments using our proposed measures, we show that the
unconstrained off-policy Deep Q-Network family requires datasets with high SACo
to find a good policy. Furthermore, experiments show that policy constraint
algorithms perform well on datasets with high TQ and SACo. Finally, the
experiments show, that purely dataset-constrained Behavioral Cloning performs
competitively to the best Offline RL algorithms for datasets with high TQ.Comment: Code: https://github.com/ml-jku/OfflineR
Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
Reinforcement Learning algorithms require a large number of samples to solve
complex tasks with sparse and delayed rewards. Complex tasks can often be
hierarchically decomposed into sub-tasks. A step in the Q-function can be
associated with solving a sub-task, where the expectation of the return
increases. RUDDER has been introduced to identify these steps and then
redistribute reward to them, thus immediately giving reward if sub-tasks are
solved. Since the problem of delayed rewards is mitigated, learning is
considerably sped up. However, for complex tasks, current exploration
strategies as deployed in RUDDER struggle with discovering episodes with high
rewards. Therefore, we assume that episodes with high rewards are given as
demonstrations and do not have to be discovered by exploration. Typically the
number of demonstrations is small and RUDDER's LSTM model as a deep learning
method does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER
with two major modifications. First, Align-RUDDER assumes that episodes with
high rewards are given as demonstrations, replacing RUDDER's safe exploration
and lessons replay buffer. Second, we replace RUDDER's LSTM model by a profile
model that is obtained from multiple sequence alignment of demonstrations.
Profile models can be constructed from as few as two demonstrations as known
from bioinformatics. Align-RUDDER inherits the concept of reward
redistribution, which considerably reduces the delay of rewards, thus speeding
up learning. Align-RUDDER outperforms competitors on complex artificial tasks
with delayed reward and few demonstrations. On the MineCraft ObtainDiamond
task, Align-RUDDER is able to mine a diamond, though not frequently. Github:
https://github.com/ml-jku/align-rudder, YouTube: https://youtu.be/HO-_8ZUl-U
Gesundheitsziele und -indikatoren als Steuerungsinstrument der sozialen Krankenversicherung: Im Auftrag des Hauptverbandes der österreichischen Sozialversicherungsträger
aus dem Inhaltsverzeichnis: Einleitung; Ausgangslage; Ziele; Verfahren und Kriterien zur Auswahl von Gesundheitszielen; Internationale Zusammenschau; Exemplarische Ableitung eines Zielkatalogs für Österreich; Literatur; Länderberichte; Anhang
Health Policy Monitor; Pharmaceutical Price Policy - Follow Up Report
There are ongoing efforts to contain cost growth in the pharmaceutical sector. This article aims to report about the latest developments in this area especially in view of the introduction of a new system of pharmaceutical price policy in 2005. Additionally this report aims to evaluate the outcome of these measures by looking at expenditure growth trends. It further summarizes recent developments in this area