1,190 research outputs found
Bigger, Better, Faster: Human-level Atari with human-level efficiency
We introduce a value-based RL agent, which we call BBF, that achieves
super-human performance in the Atari 100K benchmark. BBF relies on scaling the
neural networks used for value estimation, as well as a number of other design
choices that enable this scaling in a sample-efficient manner. We conduct
extensive analyses of these design choices and provide insights for future
work. We end with a discussion about updating the goalposts for
sample-efficient RL research on the ALE. We make our code and data publicly
available at
https://github.com/google-research/google-research/tree/master/bigger_better_faster.Comment: ICML 2023 Camera Read
Small batch deep reinforcement learning
In value-based deep reinforcement learning with replay memories, the batch
size parameter specifies how many transitions to sample for each gradient
update. Although critical to the learning process, this value is typically not
adjusted when proposing new algorithms. In this work we present a broad
empirical study that suggests {\em reducing} the batch size can result in a
number of significant performance gains; this is surprising, as the general
tendency when training neural networks is towards larger batch sizes for
improved performance. We complement our experimental findings with a set of
empirical analyses towards better understanding this phenomenon.Comment: Published at NeurIPS 202
Multi-Agent Deep Reinforcement Learning for Walkers
This project was motivated by seeking an AI method towards Artificial General Intelligence (AGI), that is, more similar to learning behavior of human-beings. As of today, Deep Reinforcement Learning (DRL) is the most closer to the AGI compared to other machine learning methods. To better understand the DRL, we compares and contrasts to other related methods: Deep Learning, Dynamic Programming and Game Theory.
We apply one of state-of-art DRL algorithms, called Proximal Policy Op- timization (PPO) to the robot walkers locomotion, as a simple yet challenging environment, inherently continuous and high-dimensional state/action space.
The end goal of this project is to train the agent by finding the optimal sequential actions (policy/strategy) of multi-walkers leading them to move forward as far as possible to maximize the accumulated reward (performance). This goal can be accomplished by finding the tuned hyperparameters of the PPO algorithm by monitoring the performances for the multi-agent DRL (MADRL) settings.
At the end, we can draw three conclusions from our findings based on the various MADRL experiments: 1) Unlike DL with explicit target labels, DRL needs larger minibatch size for better estimate of values from various gradients. There- fore, a minibatch size and its pool size (experience replay buffer) are critical hyperparameters in PPO algorithm. 2) For the homogeneous multi-agent envi- ronments, there is a mutual transferability between single-agent and multi-agent environments to be able to reuse the tuned hyperparameters. 3) For the homo- geneous multi-agent environments with a well tuned hyperparameter set, the
parameter sharing is a better strategy for the MADRL in terms of performance and efficiency with reduced parameters and less memory.
To conclude, reward-driven, sequential and evaluative learning, the DRL, would be closer to AGI if multiple DRL agents learn to collaborate to capture the true signal from the shared environment. This work provides one instance of implicit cooperative learning of MADRL
Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks
Deep Q-Networks algorithm (DQN) was the first reinforcement learning
algorithm using deep neural network to successfully surpass human level
performance in a number of Atari learning environments. However, divergent and
unstable behaviour have been long standing issues in DQNs. The unstable
behaviour is often characterised by overestimation in the -values, commonly
referred to as the overestimation bias. To address the overestimation bias and
the divergent behaviour, a number of heuristic extensions have been proposed.
Notably, multi-step updates have been shown to drastically reduce unstable
behaviour while improving agent's training performance. However, agents are
often highly sensitive to the selection of the multi-step update horizon (),
and our empirical experiments show that a poorly chosen static value for
can in many cases lead to worse performance than single-step DQN. Inspired by
the success of -step DQN and the effects that multi-step updates have on
overestimation bias, this paper proposes a new algorithm that we call `Elastic
Step DQN' (ES-DQN). It dynamically varies the step size horizon in multi-step
updates based on the similarity of states visited. Our empirical evaluation
shows that ES-DQN out-performs -step with fixed updates, Double DQN and
Average DQN in several OpenAI Gym environments while at the same time
alleviating the overestimation bias
IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy Reuse
Humans have the ability to reuse previously learned policies to solve new
tasks quickly, and reinforcement learning (RL) agents can do the same by
transferring knowledge from source policies to a related target task. Transfer
RL methods can reshape the policy optimization objective (optimization
transfer) or influence the behavior policy (behavior transfer) using source
policies. However, selecting the appropriate source policy with limited samples
to guide target policy learning has been a challenge. Previous methods
introduce additional components, such as hierarchical policies or estimations
of source policies' value functions, which can lead to non-stationary policy
optimization or heavy sampling costs, diminishing transfer effectiveness. To
address this challenge, we propose a novel transfer RL method that selects the
source policy without training extra components. Our method utilizes the Q
function in the actor-critic framework to guide policy selection, choosing the
source policy with the largest one-step improvement over the current target
policy. We integrate optimization transfer and behavior transfer (IOB) by
regularizing the learned policy to mimic the guidance policy and combining them
as the behavior policy. This integration significantly enhances transfer
effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark
tasks, and improves final performance and knowledge transferability in
continual learning scenarios. Additionally, we show that our optimization
transfer technique is guaranteed to improve target policy learning.Comment: 26 pages, 9 figure
Understanding when Dynamics-Invariant Data Augmentations Benefit Model-Free Reinforcement Learning Updates
Recently, data augmentation (DA) has emerged as a method for leveraging
domain knowledge to inexpensively generate additional data in reinforcement
learning (RL) tasks, often yielding substantial improvements in data
efficiency. While prior work has demonstrated the utility of incorporating
augmented data directly into model-free RL updates, it is not well-understood
when a particular DA strategy will improve data efficiency. In this paper, we
seek to identify general aspects of DA responsible for observed learning
improvements. Our study focuses on sparse-reward tasks with dynamics-invariant
data augmentation functions, serving as an initial step towards a more general
understanding of DA and its integration into RL training. Experimentally, we
isolate three relevant aspects of DA: state-action coverage, reward density,
and the number of augmented transitions generated per update (the augmented
replay ratio). From our experiments, we draw two conclusions: (1) increasing
state-action coverage often has a much greater impact on data efficiency than
increasing reward density, and (2) decreasing the augmented replay ratio
substantially improves data efficiency. In fact, certain tasks in our empirical
study are solvable only when the replay ratio is sufficiently low
- …