48,128 research outputs found

    Hi-Val: Iterative Learning of Hierarchical Value Functions for Policy Generation

    Get PDF
    Task decomposition is effective in manifold applications where the global complexity of a problem makes planning and decision-making too demanding. This is true, for example, in high-dimensional robotics domains, where (1) unpredictabilities and modeling limitations typically prevent the manual specification of robust behaviors, and (2) learning an action policy is challenging due to the curse of dimensionality. In this work, we borrow the concept of Hierarchical Task Networks (HTNs) to decompose the learning procedure, and we exploit Upper Confidence Tree (UCT) search to introduce HOP, a novel iterative algorithm for hierarchical optimistic planning with learned value functions. To obtain better generalization and generate policies, HOP simultaneously learns and uses action values. These are used to formalize constraints within the search space and to reduce the dimensionality of the problem. We evaluate our algorithm both on a fetching task using a simulated 7-DOF KUKA light weight arm and, on a pick and delivery task with a Pioneer robot

    Hierarchical Policy Search via Return-Weighted Density Estimation

    Full text link
    Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchical policy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been demonstrated to be promising, current state-of-the-art methods cannot still perform well in complex real-world problems due to the difficulty of identifying modes of the reward function. In this paper, we propose a novel method called hierarchical policy search via return-weighted density estimation (HPSDE), which can efficiently identify the modes through density estimation with return-weighted importance sampling. Our proposed method finds option policies corresponding to the modes of the return function and automatically determines the number and the location of option policies, which significantly reduces the burden of hyper-parameters tuning. Through experiments, we demonstrate that the proposed HPSDE successfully learns option policies corresponding to modes of the return function and that it can be successfully applied to a challenging motion planning problem of a redundant robotic manipulator.Comment: The 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), 9 page

    Online Learning with Noise: A Kernel-Based Policy-Gradient Approach

    Get PDF
    International audienceVarious forms of noise are present in the brain. The role of noise in a exploration/exploitation trade-off is cast into the framework of reinforcement learning for a complex task of motor learning. A neuro-controler using a linear transformation of the input to which is added a gaussian noise is modelized as a stochastic controler that can be learned online in ''direct policy-gradient'' scheme. The reward signal is related to sensor information, thus no direct or indirect model of the system to control is needed. The task chosen (reaching with a multi-joint arm) is redundant and non-linear. The controler inputs are then projected to a feature space of higher dimension using a topographic coding based on gaussian kernels. We show that through a consistent noise level it possible to explore the environnment so as to find good control solution that can be exploited. Besides, the controler is able to adapt continuously to changes in the system dynamics. The general framework of this work will allow to study various noises and their effect, especially since it is quite compatible with more complexe types of stochastic neuro-controler, as demonstrated by other works on binary or spiking networks
    • …
    corecore