928 research outputs found

    Smoothing Policies and Safe Policy Gradients

    Full text link
    Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

    Two-Phase Iteration for Value Function Approximation and Hyperparameter Optimization in Gaussian-Kernel-Based Adaptive Critic Design

    Get PDF
    Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernel-based adaptive critic design (ACD) was developed recently. There are two essential issues for a static kernel-based model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. They all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussian-kernel-based Adaptive Dynamic Programming (GK-ADP) approach. Simulations are used to verify its feasibility, particularly the necessity of two-phase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set

    Learning robust policies for object manipulation with robot swarms

    Get PDF
    Swarm robotics investigates how a large population of robots with simple actuation and limited sensors can collectively solve complex tasks. One particular interesting application with robot swarms is autonomous object assembly. Such tasks have been solved successfully with robot swarms that are controlled by a human operator using a light source. In this paper, we present a method to solve such assembly tasks autonomously based on policy search methods. We split the assembly process in two subtasks: generating a high-level assembly plan and learning a low-level object movement policy. The assembly policy plans the trajectories for each object and the object movement policy controls the trajectory execution. Learning the object movement policy is challenging as it depends on the complex state of the swarm which consists of an individual state for each agent. To approach this problem, we introduce a representation of the swarm which is based on Hilbert space embeddings of distributions. This representation is invariant to the number of agents in the swarm as well as to the allocation of an agent to its position in the swarm. These invariances make the learned policy robust to changes in the swarm and also reduce the search space for the policy search method significantly. We show that the resulting system is able to solve assembly tasks with varying object shapes in multiple simulation scenarios and evaluate the robustness of our representation to changes in the swarm size. Furthermore, we demonstrate that the policies learned in simulation are robust enough to be transferred to real robots

    Robust learning of object assembly tasks with an invariant representation of robot swarms

    Get PDF
    — Swarm robotics investigates how a large population of robots with simple actuation and limited sensors can collectively solve complex tasks. One particular interesting application with robot swarms is autonomous object assembly. Such tasks have been solved successfully with robot swarms that are controlled by a human operator using a light source. In this paper, we present a method to solve such assembly tasks autonomously based on policy search methods. We split the assembly process in two subtasks: generating a high-level assembly plan and learning a low-level object movement policy. The assembly policy plans the trajectories for each object and the object movement policy controls the trajectory execution. Learning the object movement policy is challenging as it depends on the complex state of the swarm which consists of an individual state for each agent. To approach this problem, we introduce a representation of the swarm which is based on Hilbert space embeddings of distributions. This representation is invariant to the number of agents in the swarm as well as to the allocation of an agent to its position in the swarm. These invariances make the learned policy robust to changes in the swarm and also reduce the search space for the policy search method significantly. We show that the resulting system is able to solve assembly tasks with varying object shapes in multiple simulation scenarios and evaluate the robustness of our representation to changes in the swarm size. Furthermore, we demonstrate that the policies learned in simulation are robust enough to be transferred to real robots

    Deep Bayesian Quadrature Policy Optimization

    Get PDF
    We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods have received little attention due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.Comment: Conference paper: AAAI-21. Code available at https://github.com/Akella17/Deep-Bayesian-Quadrature-Policy-Optimizatio

    Rate-Splitting for Intelligent Reflecting Surface-Aided Multiuser VR Streaming

    Full text link
    The growing demand for virtual reality (VR) applications requires wireless systems to provide a high transmission rate to support 360-degree video streaming to multiple users simultaneously. In this paper, we propose an intelligent reflecting surface (IRS)-aided rate-splitting (RS) VR streaming system. In the proposed system, RS facilitates the exploitation of the shared interests of the users in VR streaming, and IRS creates additional propagation channels to support the transmission of high-resolution 360-degree videos. IRS also enhances the capability to mitigate the performance bottleneck caused by the requirement that all RS users have to be able to decode the common message. We formulate an optimization problem for maximization of the achievable bitrate of the 360-degree video subject to the quality-of-service (QoS) constraints of the users. We propose a deep deterministic policy gradient with imitation learning (Deep-GRAIL) algorithm, in which we leverage deep reinforcement learning (DRL) and the hidden convexity of the formulated problem to optimize the IRS phase shifts, RS parameters, beamforming vectors, and bitrate selection of the 360-degree video tiles. We also propose RavNet, which is a deep neural network customized for the policy learning in our Deep-GRAIL algorithm. Performance evaluation based on a real-world VR streaming dataset shows that the proposed IRS-aided RS VR streaming system outperforms several baseline schemes in terms of system sum-rate, achievable bitrate of the 360-degree videos, and online execution runtime. Our results also reveal the respective performance gains obtained from RS and IRS for improving the QoS in multiuser VR streaming systems.Comment: 20 pages, 12 figures. This paper has been submitted to IEEE journal for possible publicatio
    • …
    corecore