1,222 research outputs found
Offline Prioritized Experience Replay
Offline reinforcement learning (RL) is challenged by the distributional shift
problem. To address this problem, existing works mainly focus on designing
sophisticated policy constraints between the learned policy and the behavior
policy. However, these constraints are applied equally to well-performing and
inferior actions through uniform sampling, which might negatively affect the
learned policy. To alleviate this issue, we propose Offline Prioritized
Experience Replay (OPER), featuring a class of priority functions designed to
prioritize highly-rewarding transitions, making them more frequently visited
during training. Through theoretical analysis, we show that this class of
priority functions induce an improved behavior policy, and when constrained to
this improved policy, a policy-constrained offline RL algorithm is likely to
yield a better solution. We develop two practical strategies to obtain priority
weights by estimating advantages based on a fitted value network (OPER-A) or
utilizing trajectory returns (OPER-R) for quick computation. OPER is a
plug-and-play component for offline RL algorithms. As case studies, we evaluate
OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and
IQL. Extensive experiments demonstrate that both OPER-A and OPER-R
significantly improve the performance for all baseline methods. Codes and
priority weights are availiable at https://github.com/sail-sg/OPER.Comment: preprin
Attention Loss Adjusted Prioritized Experience Replay
Prioritized Experience Replay (PER) is a technical means of deep
reinforcement learning by selecting experience samples with more knowledge
quantity to improve the training rate of neural network. However, the
non-uniform sampling used in PER inevitably shifts the state-action space
distribution and brings the estimation error of Q-value function. In this
paper, an Attention Loss Adjusted Prioritized (ALAP) Experience Replay
algorithm is proposed, which integrates the improved Self-Attention network
with Double-Sampling mechanism to fit the hyperparameter that can regulate the
importance sampling weights to eliminate the estimation error caused by PER. In
order to verify the effectiveness and generality of the algorithm, the ALAP is
tested with value-function based, policy-gradient based and multi-agent
reinforcement learning algorithms in OPENAI gym, and comparison studies verify
the advantage and efficiency of the proposed training framework
Quantum deep Q learning with distributed prioritized experience replay
This paper introduces the QDQN-DPER framework to enhance the efficiency of
quantum reinforcement learning (QRL) in solving sequential decision tasks. The
framework incorporates prioritized experience replay and asynchronous training
into the training algorithm to reduce the high sampling complexities. Numerical
simulations demonstrate that QDQN-DPER outperforms the baseline distributed
quantum Q learning with the same model architecture. The proposed framework
holds potential for more complex tasks while maintaining training efficiency
Accelerating Reinforcement Learning with Prioritized Experience Replay for Maze Game
In this paper we implemented two ways of improving the performance of reinforcement learning algorithms. We proposed a new equation to prioritize transition samples to improve model accuracy, and by deploying a generalized solver of randomly-generated two-dimensional mazes on a distributed computing platform, our dual-network model is available to others for further research and development. Reinforcement Learning is concerned with identifying the optimal sequence of actions for an agent to take in order to reach an objective to achieve the highest score in the future. Complex situations can lead to computational challenges in terms of both finding the best answer and the training time required to do so. Our prioritization algorithm increased model accuracy by 7% versus a baseline model with no prioritization, and using five workers on the RAY platform using RLlib achieved a 4.5X acceleration in training time versus using one worker
ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning, & Snapshot Ensembling
ViZDoom is a robust, first-person shooter reinforcement learning environment,
characterized by a significant degree of latent state information. In this
paper, double-Q learning and prioritized experience replay methods are tested
under a certain ViZDoom combat scenario using a competitive deep recurrent
Q-network (DRQN) architecture. In addition, an ensembling technique known as
snapshot ensembling is employed using a specific annealed learning rate to
observe differences in ensembling efficacy under these two methods. Annealed
learning rates are important in general to the training of deep neural network
models, as they shake up the status-quo and counter a model's tending towards
local optima. While both variants show performance exceeding those of built-in
AI agents of the game, the known stabilizing effects of double-Q learning are
illustrated, and priority experience replay is again validated in its
usefulness by showing immediate results early on in agent development, with the
caveat that value overestimation is accelerated in this case. In addition, some
unique behaviors are observed to develop for priority experience replay (PER)
and double-Q (DDQ) variants, and snapshot ensembling of both PER and DDQ proves
a valuable method for improving performance of the ViZDoom Marine.Comment: 9 pages, 5 figure
A Proposed Priority Pushing and Grasping Strategy Based on an Improved Actor-Critic Algorithm
The most basic and primary skills of a robot are pushing and grasping. In cluttered scenes, push to make room for arms and fingers to grasp objects. We propose a modified Actor-Critic (A-C) framework for deep reinforcement learning, Cross-entropy Softmax A-C (CSAC), and use the Prioritized Experience Replay (PER) based on the theoretical foundation and main methods of deep reinforcement learning, combining the advantages of algorithms based on value functions and policy gradients. The grasping model is trained using self-supervised learning to achieve end-to-end mapping from image to propulsion and grasping action. A vision module and an action module have been created out of the entire algorithm framework. The prioritized experience replay is improved to further improve the CSAC-PER algorithm for model sample diversity and robot exploration performance during robot grasping training. The experience replay buffer is dynamically sampled using the prior beta distribution and the dynamic sampling algorithm based on the beta distribution (CSAC-beta) is proposed based on the CSAC algorithm. Despite its low initial efficiency, the experimental simulation results show that the CSAC-beta algorithm eventually achieves good results and has a higher grasping success rate (90%)
- …