614 research outputs found

    Fitted Q-iteration by advantage weighted regression

    Get PDF
    Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage weighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces

    Hierarchical relative entropy policy search

    Get PDF
    Many real-world problems are inherently hierarchically structured. The use of this structure in an agent’s policy may well be the key to improved scalability and higher performance. However, such hierarchical structures cannot be exploited by current policy search algorithms. We will concentrate on a basic, but highly relevant hierarchy — the ‘mixed option’ policy. Here, a gating network first decides which of the options to execute and, subsequently, the option-policy determines the action. In this paper, we reformulate learning a hierarchical policy as a latent variable estimation problem and subsequently extend the Relative Entropy Policy Search (REPS) to the latent variable case. We show that our Hierarchical REPS can learn versatile solutions while also showing an increased performance in terms of learning speed and quality of the found policy in comparison to the nonhierarchical approach

    Non-parametric policy search with limited information loss

    Get PDF
    Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the dataset. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non-parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations. We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated efficiently. Finally, we show that our algorithm can learn a real-robot underpowered swing-up task directly from image data

    Learning of non-parametric control policies with high-dimensional state features

    Get PDF
    Learning complex control policies from highdimensional sensory input is a challenge for reinforcement learning algorithms. Kernel methods that approximate values functions or transition models can address this problem. Yet, many current approaches rely on instable greedy maximization. In this paper, we develop a policy search algorithm that integrates robust policy updates and kernel embeddings. Our method can learn nonparametric control policies for infinite horizon continuous MDPs with high-dimensional sensory representations. We show that our method outperforms related approaches, and that our algorithm can learn an underpowered swing-up task task directly from highdimensional image data

    Non-parametric policy search with limited information loss

    Get PDF
    Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the data set. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non- parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations. We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated efficiently. Finally, we show that our algorithm can learn a real-robot under-powered swing-up task directly from image data

    State-regularized policy search for linearized dynamical systems

    Get PDF
    Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of feedbackcontrollers by taking advantage of local approximations of model dynamics and cost functions. Stability of the policy update is a major issue for these methods, rendering them hard to apply for highly nonlinear systems. Recent approaches combine classical Stochastic Optimal Control methods with information-theoretic bounds to control the step-size of the policy update and could even be used to train nonlinear deep control policies. These methods bound the relative entropy between the new and the old policy to ensure a stable policy update. However, despite the bound in policy space, the state distributions of two consecutive policies can still differ significantly, rendering the used local approximate models invalid. To alleviate this issue we propose enforcing a relative entropy constraint not only on the policy update, but also on the update of the state distribution, around which the dynamics and cost are being approximated. We present a derivation of the closed-form policy update and show that our approach outperforms related methods on two nonlinear and highly dynamic simulated systems

    Probabilistic prioritization of movement primitives

    Get PDF
    Movement prioritization is a common approach to combine controllers of different tasks for redundant robots, where each task is assigned a priority. The priorities of the tasks are often hand-tuned or the result of an optimization, but seldomly learned from data. This paper combines Bayesian task prioritization with probabilistic movement primitives to prioritize full motion sequences that are learned from demonstrations. Probabilistic movement primitives (ProMPs) can encode distributions of movements over full motion sequences and provide control laws to exactly follow these distributions. The probabilistic formulation allows for a natural application of Bayesian task prioritization. We extend the ProMP controllers with an additional feedback component that accounts inaccuracies in following the distribution and allows for a more robust prioritization of primitives. We demonstrate how the task priorities can be obtained from imitation learning and how different primitives can be combined to solve even unseen task-combinations. Due to the prioritization, our approach can efficiently learn a combination of tasks without requiring individual models per task combination. Further, our approach can adapt an existing primitive library by prioritizing additional controllers, for example, for implementing obstacle avoidance. Hence, the need of retraining the whole library is avoided in many cases. We evaluate our approach on reaching movements under constraints with redundant simulated planar robots and two physical robot platforms, the humanoid robot “iCub” and a KUKA LWR robot arm

    Probabilistic Approach to Physical Object Disentangling

    Get PDF

    Probabilistic approach to physical object disentangling

    Get PDF
    Physically disentangling entangled objects from each other is a problem encountered in waste segregation or in any task that requires disassembly of structures. Often there are no object models, and, especially with cluttered irregularly shaped objects, the robot can not create a model of the scene due to occlusion. One of our key insights is that based on previous sensory input we are only interested in moving an object out of the disentanglement around obstacles. That is, we only need to know where the robot can successfully move in order to plan the disentangling. Due to the uncertainty we integrate information about blocked movements into a probability map. The map defines the probability of the robot successfully moving to a specific configuration. Using as cost the failure probability of a sequence of movements we can then plan and execute disentangling iteratively. Since our approach circumvents only previously encountered obstacles, new movements will yield information about unknown obstacles that block movement until the robot has learned to circumvent all obstacles and disentangling succeeds. In the experiments, we use a special probabilistic version of the Rapidly exploring Random Tree (RRT) algorithm for planning and demonstrate successful disentanglement of objects both in 2-D and 3-D simulation, and, on a KUKA LBR 7-DOF robot. Moreover, our approach outperforms baseline methods
    • 

    corecore