3,931 research outputs found
Fast deep reinforcement learning using online adjustments from the past
We propose Ephemeral Value Adjusments (EVA): a means of allowing deep
reinforcement learning agents to rapidly adapt to experience in their replay
buffer. EVA shifts the value predicted by a neural network with an estimate of
the value function found by planning over experience tuples from the replay
buffer near the current state. EVA combines a number of recent ideas around
combining episodic memory-like structures into reinforcement learning agents:
slot-based storage, content-based retrieval, and memory-based planning. We show
that EVAis performant on a demonstration task and Atari games.Comment: Accepted at NIPS 201
An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode Discovery in GFlowNets
Reinforcement Learning (RL) algorithms aim to learn an optimal policy by
iteratively sampling actions to learn how to maximize the total expected
return, . GFlowNets are a special class of algorithms designed to
generate diverse candidates, , from a discrete set, by learning a policy
that approximates the proportional sampling of . GFlowNets exhibit
improved mode discovery compared to conventional RL algorithms, which is very
useful for applications such as drug discovery and combinatorial search.
However, since GFlowNets are a relatively recent class of algorithms, many
techniques which are useful in RL have not yet been associated with them. In
this paper, we study the utilization of a replay buffer for GFlowNets. We
explore empirically various replay buffer sampling techniques and assess the
impact on the speed of mode discovery and the quality of the modes discovered.
Our experimental results in the Hypergrid toy domain and a molecule synthesis
environment demonstrate significant improvements in mode discovery when
training with a replay buffer, compared to training only with trajectories
generated on-policy.Comment: Accepted to ICML 2023 workshop on Structured Probabilistic Inference
& Generative Modelin
Offline Experience Replay for Continual Offline Reinforcement Learning
The capability of continuously learning new skills via a sequence of
pre-collected offline datasets is desired for an agent. However, consecutively
learning a sequence of offline tasks likely leads to the catastrophic
forgetting issue under resource-limited scenarios. In this paper, we formulate
a new setting, continual offline reinforcement learning (CORL), where an agent
learns a sequence of offline reinforcement learning tasks and pursues good
performance on all learned tasks with a small replay buffer without exploring
any of the environments of all the sequential tasks. For consistently learning
on all sequential tasks, an agent requires acquiring new knowledge and
meanwhile preserving old knowledge in an offline manner. To this end, we
introduced continual learning algorithms and experimentally found experience
replay (ER) to be the most suitable algorithm for the CORL problem. However, we
observe that introducing ER into CORL encounters a new distribution shift
problem: the mismatch between the experiences in the replay buffer and
trajectories from the learned policy. To address such an issue, we propose a
new model-based experience selection (MBES) scheme to build the replay buffer,
where a transition model is learned to approximate the state distribution. This
model is used to bridge the distribution bias between the replay buffer and the
learned model by filtering the data from offline data that most closely
resembles the learned model for storage. Moreover, in order to enhance the
ability on learning new tasks, we retrofit the experience replay method with a
new dual behavior cloning (DBC) architecture to avoid the disturbance of
behavior-cloning loss on the Q-learning process. In general, we call our
algorithm offline experience replay (OER). Extensive experiments demonstrate
that our OER method outperforms SOTA baselines in widely-used Mujoco
environments.Comment: 9 pages, 4 figure
Dynamic Weights in Multi-Objective Deep Reinforcement Learning
Many real-world decision problems are characterized by multiple conflicting
objectives which must be balanced based on their relative importance. In the
dynamic weights setting the relative importance changes over time and
specialized algorithms that deal with such change, such as a tabular
Reinforcement Learning (RL) algorithm by Natarajan and Tadepalli (2005), are
required. However, this earlier work is not feasible for RL settings that
necessitate the use of function approximators. We generalize across weight
changes and high-dimensional inputs by proposing a multi-objective Q-network
whose outputs are conditioned on the relative importance of objectives and we
introduce Diverse Experience Replay (DER) to counter the inherent
non-stationarity of the Dynamic Weights setting. We perform an extensive
experimental evaluation and compare our methods to adapted algorithms from Deep
Multi-Task/Multi-Objective Reinforcement Learning and show that our proposed
network in combination with DER dominates these adapted algorithms across
weight change scenarios and problem domains
The Role of Diverse Replay for Generalisation in Reinforcement Learning
In reinforcement learning (RL), key components of many algorithms are the
exploration strategy and replay buffer. These strategies regulate what
environment data is collected and trained on and have been extensively studied
in the RL literature. In this paper, we investigate the impact of these
components in the context of generalisation in multi-task RL. We investigate
the hypothesis that collecting and training on more diverse data from the
training environment will improve zero-shot generalisation to new
environments/tasks. We motivate mathematically and show empirically that
generalisation to states that are "reachable" during training is improved by
increasing the diversity of transitions in the replay buffer. Furthermore, we
show empirically that this same strategy also shows improvement for
generalisation to similar but "unreachable" states and could be due to improved
generalisation of latent representations.Comment: 14 pages, 8 figure
- …