136,262 research outputs found
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
Reinforcement learning is an important branch of machine learning.With the development of deep learning,deep reinforcement learning research has gradually developed into the focus of reinforcement learning research.Model-free off-policy deep reinforcement learning algorithms for continuous control attract everyone’s attention because of their strong practicality.Like Q-learning,algorithms based on actor-critic suffer from the problem of overestimations.To a certain extent,clipped double Q-lear-ning method solves the effect of the overestimation in actor-critic algorithms,but it also introduces underestimation to the lear-ning process.In order to further solve the problems of overestimation and underestimation in the actor-critic algorithms,a new learning method,randomly weighted triple Q-learning method is proposed.In addition,combining the new method with the soft actor critic algorithm,a new soft actor critic algorithm based on randomly weighted triple Q-learning is proposed.This algorithm not only limits the Q estimation value near the real Q value,but also increases the randomness of the Q estimation value through randomly weighted method,so as to solve the problems of overestimation and underestimation of action value in the learning process.Experiment results show that,compared to the SAC algorithm and other currently popular deep reinforcement learning algorithms such as DDPG,PPO and TD3,the SAC-RWTQ algorithm has better performance on several Mujoco tasks on the gym simulation platform
Careful at Estimation and Bold at Exploration
Exploration strategies in continuous action space are often heuristic due to
the infinite actions, and these kinds of methods cannot derive a general
conclusion. In prior work, it has been shown that policy-based exploration is
beneficial for continuous action space in deterministic policy reinforcement
learning(DPRL). However, policy-based exploration in DPRL has two prominent
issues: aimless exploration and policy divergence, and the policy gradient for
exploration is only sometimes helpful due to inaccurate estimation. Based on
the double-Q function framework, we introduce a novel exploration strategy to
mitigate these issues, separate from the policy gradient. We first propose the
greedy Q softmax update schema for Q value update. The expected Q value is
derived by weighted summing the conservative Q value over actions, and the
weight is the corresponding greedy Q value. Greedy Q takes the maximum value of
the two Q functions, and conservative Q takes the minimum value of the two
different Q functions. For practicality, this theoretical basis is then
extended to allow us to combine action exploration with the Q value update,
except for the premise that we have a surrogate policy that behaves like this
exploration policy. In practice, we construct such an exploration policy with a
few sampled actions, and to meet the premise, we learn such a surrogate policy
by minimizing the KL divergence between the target policy and the exploration
policy constructed by the conservative Q. We evaluate our method on the Mujoco
benchmark and demonstrate superior performance compared to previous
state-of-the-art methods across various environments, particularly in the most
complex Humanoid environment.Comment: 20 page
Learning Optimal Biomarker-Guided Treatment Policy for Chronic Disorders
Electroencephalogram (EEG) provides noninvasive measures of brain activity
and is found to be valuable for diagnosis of some chronic disorders.
Specifically, pre-treatment EEG signals in alpha and theta frequency bands have
demonstrated some association with anti-depressant response, which is
well-known to have low response rate. We aim to design an integrated pipeline
that improves the response rate of major depressive disorder patients by
developing an individualized treatment policy guided by the resting state
pre-treatment EEG recordings and other treatment effects modifiers. We first
design an innovative automatic site-specific EEG preprocessing pipeline to
extract features that possess stronger signals compared with raw data. We then
estimate the conditional average treatment effect using causal forests, and use
a doubly robust technique to improve the efficiency in the estimation of the
average treatment effect. We present evidence of heterogeneity in the treatment
effect and the modifying power of EEG features as well as a significant average
treatment effect, a result that cannot be obtained by conventional methods.
Finally, we employ an efficient policy learning algorithm to learn an optimal
depth-2 treatment assignment decision tree and compare its performance with
Q-Learning and outcome-weighted learning via simulation studies and an
application to a large multi-site, double-blind randomized controlled clinical
trial, EMBARC
Danger-aware Adaptive Composition of DRL Agents for Self-navigation
Self-navigation, referred as the capability of automatically reaching the
goal while avoiding collisions with obstacles, is a fundamental skill required
for mobile robots. Recently, deep reinforcement learning (DRL) has shown great
potential in the development of robot navigation algorithms. However, it is
still difficult to train the robot to learn goal-reaching and
obstacle-avoidance skills simultaneously. On the other hand, although many
DRL-based obstacle-avoidance algorithms are proposed, few of them are reused
for more complex navigation tasks. In this paper, a novel danger-aware adaptive
composition (DAAC) framework is proposed to combine two individually
DRL-trained agents, obstacle-avoidance and goal-reaching, to construct a
navigation agent without any redesigning and retraining. The key to this
adaptive composition approach is that the value function outputted by the
obstacle-avoidance agent serves as an indicator for evaluating the risk level
of the current situation, which in turn determines the contribution of these
two agents for the next move. Simulation and real-world testing results show
that the composed Navigation network can control the robot to accomplish
difficult navigation tasks, e.g., reaching a series of successive goals in an
unknown and complex environment safely and quickly.Comment: 7 pages, 9 figure
- …