21 research outputs found
Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk
Though deep reinforcement learning (DRL) has obtained substantial success, it
may encounter catastrophic failures due to the intrinsic uncertainty of both
transition and observation. Most of the existing methods for safe reinforcement
learning can only handle transition disturbance or observation disturbance
since these two kinds of disturbance affect different parts of the agent;
besides, the popular worst-case return may lead to overly pessimistic policies.
To address these issues, we first theoretically prove that the performance
degradation under transition disturbance and observation disturbance depends on
a novel metric of Value Function Range (VFR), which corresponds to the gap in
the value function between the best state and the worst state. Based on the
analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk
and propose a novel reinforcement learning algorithm of
CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive
constrained optimization problem by keeping its CVaR under a given threshold.
Experimental results show that CPPO achieves a higher cumulative reward and is
more robust against both observation and transition disturbances on a series of
continuous control tasks in MuJoCo
Consistent Attack: Universal Adversarial Perturbation on Embodied Vision Navigation
Embodied agents in vision navigation coupled with deep neural networks have
attracted increasing attention. However, deep neural networks have been shown
vulnerable to malicious adversarial noises, which may potentially cause
catastrophic failures in Embodied Vision Navigation. Among different
adversarial noises, universal adversarial perturbations (UAP), i.e., a constant
image-agnostic perturbation applied on every input frame of the agent, play a
critical role in Embodied Vision Navigation since they are
computation-efficient and application-practical during the attack. However,
existing UAP methods ignore the system dynamics of Embodied Vision Navigation
and might be sub-optimal. In order to extend UAP to the sequential decision
setting, we formulate the disturbed environment under the universal noise
, as a -disturbed Markov Decision Process (-MDP). Based
on the formulation, we analyze the properties of -MDP and propose two
novel Consistent Attack methods, named Reward UAP and Trajectory UAP, for
attacking Embodied agents, which consider the dynamic of the MDP and calculate
universal noises by estimating the disturbed distribution and the disturbed Q
function. For various victim models, our Consistent Attack can cause a
significant drop in their performance in the PointGoal task in Habitat with
different datasets and different scenes. Extensive experimental results
indicate that there exist serious potential risks for applying Embodied Vision
Navigation methods to the real world
Task Aware Dreamer for Task Generalization in Reinforcement Learning
A long-standing goal of reinforcement learning is to acquire agents that can
learn on training tasks and generalize well on unseen tasks that may share a
similar dynamic but with different reward functions. A general challenge is to
quantitatively measure the similarities between these different tasks, which is
vital for analyzing the task distribution and further designing algorithms with
stronger generalization. To address this, we present a novel metric named Task
Distribution Relevance (TDR) via optimal Q functions of different tasks to
capture the relevance of the task distribution quantitatively. In the case of
tasks with a high TDR, i.e., the tasks differ significantly, we show that the
Markovian policies cannot differentiate them, leading to poor performance.
Based on this insight, we encode all historical information into policies for
distinguishing different tasks and propose Task Aware Dreamer (TAD), which
extends world models into our reward-informed world models to capture invariant
latent features over different tasks. In TAD, we calculate the corresponding
variational lower bound of the data log-likelihood, including a novel term to
distinguish different tasks via states, to optimize reward-informed world
models. Extensive experiments in both image-based control tasks and state-based
control tasks demonstrate that TAD can significantly improve the performance of
handling different tasks simultaneously, especially for those with high TDR,
and demonstrate a strong generalization ability to unseen tasks
Search for light dark matter from atmosphere in PandaX-4T
We report a search for light dark matter produced through the cascading decay
of mesons, which are created as a result of inelastic collisions between
cosmic rays and Earth's atmosphere. We introduce a new and general framework,
publicly accessible, designed to address boosted dark matter specifically, with
which a full and dedicated simulation including both elastic and quasi-elastic
processes of Earth attenuation effect on the dark matter particles arriving at
the detector is performed. In the PandaX-4T commissioning data of 0.63
tonneyear exposure, no significant excess over background is observed.
The first constraints on the interaction between light dark matter generated in
the atmosphere and nucleus through a light scalar mediator are obtained. The
lowest excluded cross-section is set at for
dark matter mass of MeV and mediator mass of 300 MeV. The
lowest upper limit of to dark matter decay branching ratio is
Document clustering using sample weighting
Clustering algorithm based on Sample weighting has been noticed recently. In this paper, a novel sample weighting clustering algorithm is presented based on K-Means and fuzzy C-Means algorithm. The algorithm uses academic documents as the clustering objects. The PageRank value of each document is calculated according to the cited relationship among them, and it is used as the weight in the algorithm. Experiments show that the proposed algorithm is effective to improve performance of document clustering
A quantitative evaluation system of Chinese journals in the humanities and social sciences
Based on analyses on existing indicators for evaluating journals in the humanities and social sciences and our experience in constructing the Chinese Social Science Citation Index (CSSCI), we proposed a comprehensive system for evaluating Chinese academic journals in the humanities and social sciences. This system constitutes 8 primary indicators and 17 sub-indicators for multidisciplinary journals and 19 sub-indicators for discipline-specific journals. Each indicator or sub-indicator is assigned a suitable weight according to its importance in terms of measuring a journal’s academic quality and/or impact.</p
A quantitative evaluation system of Chinese journals in the humanities and social sciences journals in the humanities and social sciences
Based on analyses on existing indicators for evaluating journals in the humanities and social sciences and our experience in constructing the Chinese Social Science Citation Index (CSSCI), we proposed a comprehensive system for evaluating Chinese academic journals in the humanities and social sciences. This system constitutes 8 primary indicators and 17 sub-indicators for multidisciplinary journals and 19 sub-indicators for discipline-specific journals. Each indicator or sub-indicator is assigned a suitable weight according to its importance in terms of measuring a journal’s academic quality and/or impact.status: publishe
Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning
Deep reinforcement learning models are vulnerable to adversarial attacks that
can decrease a victim's cumulative expected reward by manipulating the victim's
observations. Despite the efficiency of previous optimization-based methods for
generating adversarial noise in supervised learning, such methods might not be
able to achieve the lowest cumulative reward since they do not explore the
environmental dynamics in general. In this paper, we provide a framework to
better understand the existing methods by reformulating the problem of
adversarial attacks on reinforcement learning in the function space. Our
reformulation generates an optimal adversary in the function space of the
targeted attacks, repelling them via a generic two-stage framework. In the
first stage, we train a deceptive policy by hacking the environment, and
discover a set of trajectories routing to the lowest reward or the worst-case
performance. Next, the adversary misleads the victim to imitate the deceptive
policy by perturbing the observations. Compared to existing approaches, we
theoretically show that our adversary is stronger under an appropriate noise
level. Extensive experiments demonstrate our method's superiority in terms of
efficiency and effectiveness, achieving the state-of-the-art performance in
both Atari and MuJoCo environments
An Implementation of Actor-Critic Algorithm on Spiking Neural Network Using Temporal Coding Method
Taking advantage of faster speed, less resource consumption and better biological interpretability of spiking neural networks, this paper developed a novel spiking neural network reinforcement learning method using actor-critic architecture and temporal coding. The simple improved leaky integrate-and-fire (LIF) model was used to describe the behavior of a spike neuron. Then the actor-critic network structure and the update formulas using temporally encoded information were provided. The current model was finally examined in the decision-making task, the gridworld task, the UAV flying through a window task and the avoiding a flying basketball task. In the 5 × 5 grid map, the value function learned was close to the ideal situation and the quickest way from one state to another was found. A UAV trained by this method was able to fly through the window quickly in simulation. An actual flight test of a UAV avoiding a flying basketball was conducted. With this model, the success rate of the test was 96% and the average decision time was 41.3 ms. The results show the effectiveness and accuracy of the temporal coded spiking neural network RL method. In conclusion, an attempt was made to provide insights into developing spiking neural network reinforcement learning methods for decision-making and autonomous control of unmanned systems