Search CORE

136,262 research outputs found

Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning

Author: FAN Jing-yu LIU Quan
Publication venue: Editorial office of Computer Science
Publication date: 01/06/2022
Field of study

Reinforcement learning is an important branch of machine learning.With the development of deep learning,deep reinforcement learning research has gradually developed into the focus of reinforcement learning research.Model-free off-policy deep reinforcement learning algorithms for continuous control attract everyone’s attention because of their strong practicality.Like Q-learning,algorithms based on actor-critic suffer from the problem of overestimations.To a certain extent,clipped double Q-lear-ning method solves the effect of the overestimation in actor-critic algorithms,but it also introduces underestimation to the lear-ning process.In order to further solve the problems of overestimation and underestimation in the actor-critic algorithms,a new learning method,randomly weighted triple Q-learning method is proposed.In addition,combining the new method with the soft actor critic algorithm,a new soft actor critic algorithm based on randomly weighted triple Q-learning is proposed.This algorithm not only limits the Q estimation value near the real Q value,but also increases the randomness of the Q estimation value through randomly weighted method,so as to solve the problems of overestimation and underestimation of action value in the learning process.Experiment results show that,compared to the SAC algorithm and other currently popular deep reinforcement learning algorithms such as DDPG,PPO and TD3,the SAC-RWTQ algorithm has better performance on several Mujoco tasks on the gym simulation platform

Directory of Open Access Journals

Careful at Estimation and Bold at Exploration

Author: Chang Yi
Chen Hechang
Chen Xing
Liu Yijun
Liu Zhaogeng
Yao Hengshuai
Publication venue
Publication date: 22/08/2023
Field of study

Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.Comment: 20 page

arXiv.org e-Print Archive

Learning Optimal Biomarker-Guided Treatment Policy for Chronic Disorders

Author: Guo Xingche
Loh Ji Meng
Wang Qinxia
Wang Yuanjia
Yang Bin
Publication venue
Publication date: 23/05/2023
Field of study

Electroencephalogram (EEG) provides noninvasive measures of brain activity and is found to be valuable for diagnosis of some chronic disorders. Specifically, pre-treatment EEG signals in alpha and theta frequency bands have demonstrated some association with anti-depressant response, which is well-known to have low response rate. We aim to design an integrated pipeline that improves the response rate of major depressive disorder patients by developing an individualized treatment policy guided by the resting state pre-treatment EEG recordings and other treatment effects modifiers. We first design an innovative automatic site-specific EEG preprocessing pipeline to extract features that possess stronger signals compared with raw data. We then estimate the conditional average treatment effect using causal forests, and use a doubly robust technique to improve the efficiency in the estimation of the average treatment effect. We present evidence of heterogeneity in the treatment effect and the modifying power of EEG features as well as a significant average treatment effect, a result that cannot be obtained by conventional methods. Finally, we employ an efficient policy learning algorithm to learn an optimal depth-2 treatment assignment decision tree and compare its performance with Q-Learning and outcome-weighted learning via simulation studies and an application to a large multi-site, double-blind randomized controlled clinical trial, EMBARC

arXiv.org e-Print Archive

Danger-aware Adaptive Composition of DRL Agents for Self-navigation

Author: Liu Ning
Zhang Wei
Zhang Yunfeng
Publication venue
Publication date: 08/01/2020
Field of study

Self-navigation, referred as the capability of automatically reaching the goal while avoiding collisions with obstacles, is a fundamental skill required for mobile robots. Recently, deep reinforcement learning (DRL) has shown great potential in the development of robot navigation algorithms. However, it is still difficult to train the robot to learn goal-reaching and obstacle-avoidance skills simultaneously. On the other hand, although many DRL-based obstacle-avoidance algorithms are proposed, few of them are reused for more complex navigation tasks. In this paper, a novel danger-aware adaptive composition (DAAC) framework is proposed to combine two individually DRL-trained agents, obstacle-avoidance and goal-reaching, to construct a navigation agent without any redesigning and retraining. The key to this adaptive composition approach is that the value function outputted by the obstacle-avoidance agent serves as an indicator for evaluating the risk level of the current situation, which in turn determines the contribution of these two agents for the next move. Simulation and real-world testing results show that the composed Navigation network can control the robot to accomplish difficult navigation tasks, e.g., reaching a series of successive goals in an unknown and complex environment safely and quickly.Comment: 7 pages, 9 figure

arXiv.org e-Print Archive

ScholarBank@NUS