21 research outputs found

    Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

    Full text link
    Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo

    Consistent Attack: Universal Adversarial Perturbation on Embodied Vision Navigation

    Full text link
    Embodied agents in vision navigation coupled with deep neural networks have attracted increasing attention. However, deep neural networks have been shown vulnerable to malicious adversarial noises, which may potentially cause catastrophic failures in Embodied Vision Navigation. Among different adversarial noises, universal adversarial perturbations (UAP), i.e., a constant image-agnostic perturbation applied on every input frame of the agent, play a critical role in Embodied Vision Navigation since they are computation-efficient and application-practical during the attack. However, existing UAP methods ignore the system dynamics of Embodied Vision Navigation and might be sub-optimal. In order to extend UAP to the sequential decision setting, we formulate the disturbed environment under the universal noise δ\delta, as a δ\delta-disturbed Markov Decision Process (δ\delta-MDP). Based on the formulation, we analyze the properties of δ\delta-MDP and propose two novel Consistent Attack methods, named Reward UAP and Trajectory UAP, for attacking Embodied agents, which consider the dynamic of the MDP and calculate universal noises by estimating the disturbed distribution and the disturbed Q function. For various victim models, our Consistent Attack can cause a significant drop in their performance in the PointGoal task in Habitat with different datasets and different scenes. Extensive experimental results indicate that there exist serious potential risks for applying Embodied Vision Navigation methods to the real world

    Task Aware Dreamer for Task Generalization in Reinforcement Learning

    Full text link
    A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. A general challenge is to quantitatively measure the similarities between these different tasks, which is vital for analyzing the task distribution and further designing algorithms with stronger generalization. To address this, we present a novel metric named Task Distribution Relevance (TDR) via optimal Q functions of different tasks to capture the relevance of the task distribution quantitatively. In the case of tasks with a high TDR, i.e., the tasks differ significantly, we show that the Markovian policies cannot differentiate them, leading to poor performance. Based on this insight, we encode all historical information into policies for distinguishing different tasks and propose Task Aware Dreamer (TAD), which extends world models into our reward-informed world models to capture invariant latent features over different tasks. In TAD, we calculate the corresponding variational lower bound of the data log-likelihood, including a novel term to distinguish different tasks via states, to optimize reward-informed world models. Extensive experiments in both image-based control tasks and state-based control tasks demonstrate that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and demonstrate a strong generalization ability to unseen tasks

    Search for light dark matter from atmosphere in PandaX-4T

    Full text link
    We report a search for light dark matter produced through the cascading decay of η\eta mesons, which are created as a result of inelastic collisions between cosmic rays and Earth's atmosphere. We introduce a new and general framework, publicly accessible, designed to address boosted dark matter specifically, with which a full and dedicated simulation including both elastic and quasi-elastic processes of Earth attenuation effect on the dark matter particles arriving at the detector is performed. In the PandaX-4T commissioning data of 0.63 tonne\cdotyear exposure, no significant excess over background is observed. The first constraints on the interaction between light dark matter generated in the atmosphere and nucleus through a light scalar mediator are obtained. The lowest excluded cross-section is set at 5.9×1037cm25.9 \times 10^{-37}{\rm cm^2} for dark matter mass of 0.10.1 MeV/c2/c^2 and mediator mass of 300 MeV/c2/c^2. The lowest upper limit of η\eta to dark matter decay branching ratio is 1.6×1071.6 \times 10^{-7}

    Document clustering using sample weighting

    Get PDF
    Clustering algorithm based on Sample weighting has been noticed recently. In this paper, a novel sample weighting clustering algorithm is presented based on K-Means and fuzzy C-Means algorithm. The algorithm uses academic documents as the clustering objects. The PageRank value of each document is calculated according to the cited relationship among them, and it is used as the weight in the algorithm. Experiments show that the proposed algorithm is effective to improve performance of document clustering

    A quantitative evaluation system of Chinese journals in the humanities and social sciences

    No full text
    Based on analyses on existing indicators for evaluating journals in the humanities and social sciences and our experience in constructing the Chinese Social Science Citation Index (CSSCI), we proposed a comprehensive system for evaluating Chinese academic journals in the humanities and social sciences. This system constitutes 8 primary indicators and 17 sub-indicators for multidisciplinary journals and 19 sub-indicators for discipline-specific journals. Each indicator or sub-indicator is assigned a suitable weight according to its importance in terms of measuring a journal&rsquo;s academic quality and/or impact.</p

    A quantitative evaluation system of Chinese journals in the humanities and social sciences journals in the humanities and social sciences

    No full text
    Based on analyses on existing indicators for evaluating journals in the humanities and social sciences and our experience in constructing the Chinese Social Science Citation Index (CSSCI), we proposed a comprehensive system for evaluating Chinese academic journals in the humanities and social sciences. This system constitutes 8 primary indicators and 17 sub-indicators for multidisciplinary journals and 19 sub-indicators for discipline-specific journals. Each indicator or sub-indicator is assigned a suitable weight according to its importance in terms of measuring a journal’s academic quality and/or impact.status: publishe

    Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning

    Full text link
    Deep reinforcement learning models are vulnerable to adversarial attacks that can decrease a victim's cumulative expected reward by manipulating the victim's observations. Despite the efficiency of previous optimization-based methods for generating adversarial noise in supervised learning, such methods might not be able to achieve the lowest cumulative reward since they do not explore the environmental dynamics in general. In this paper, we provide a framework to better understand the existing methods by reformulating the problem of adversarial attacks on reinforcement learning in the function space. Our reformulation generates an optimal adversary in the function space of the targeted attacks, repelling them via a generic two-stage framework. In the first stage, we train a deceptive policy by hacking the environment, and discover a set of trajectories routing to the lowest reward or the worst-case performance. Next, the adversary misleads the victim to imitate the deceptive policy by perturbing the observations. Compared to existing approaches, we theoretically show that our adversary is stronger under an appropriate noise level. Extensive experiments demonstrate our method's superiority in terms of efficiency and effectiveness, achieving the state-of-the-art performance in both Atari and MuJoCo environments

    Guest editorial

    No full text

    An Implementation of Actor-Critic Algorithm on Spiking Neural Network Using Temporal Coding Method

    No full text
    Taking advantage of faster speed, less resource consumption and better biological interpretability of spiking neural networks, this paper developed a novel spiking neural network reinforcement learning method using actor-critic architecture and temporal coding. The simple improved leaky integrate-and-fire (LIF) model was used to describe the behavior of a spike neuron. Then the actor-critic network structure and the update formulas using temporally encoded information were provided. The current model was finally examined in the decision-making task, the gridworld task, the UAV flying through a window task and the avoiding a flying basketball task. In the 5 &times; 5 grid map, the value function learned was close to the ideal situation and the quickest way from one state to another was found. A UAV trained by this method was able to fly through the window quickly in simulation. An actual flight test of a UAV avoiding a flying basketball was conducted. With this model, the success rate of the test was 96% and the average decision time was 41.3 ms. The results show the effectiveness and accuracy of the temporal coded spiking neural network RL method. In conclusion, an attempt was made to provide insights into developing spiking neural network reinforcement learning methods for decision-making and autonomous control of unmanned systems
    corecore