10,240 research outputs found
Path Integral Policy Improvement with Covariance Matrix Adaptation
There has been a recent focus in reinforcement learning on addressing
continuous state and action problems by optimizing parameterized policies. PI2
is a recent example of this approach. It combines a derivation from first
principles of stochastic optimal control with tools from statistical estimation
theory. In this paper, we consider PI2 as a member of the wider family of
methods which share the concept of probability-weighted averaging to
iteratively update parameters to optimize a cost function. We compare PI2 to
other members of the same family - Cross-Entropy Methods and CMAES - at the
conceptual level and in terms of performance. The comparison suggests the
derivation of a novel algorithm which we call PI2-CMA for "Path Integral Policy
Improvement with Covariance Matrix Adaptation". PI2-CMA's main advantage is
that it determines the magnitude of the exploration noise automatically.Comment: ICML201
Path integral policy improvement with differential dynamic programming
Path Integral Policy Improvement with Covariance Matrix Adaptation (PI2-CMA) is a step-based model free reinforcement learning approach that combines statistical estimation techniques with fundamental results from Stochastic Optimal Control. Basically, a policy distribution is improved iteratively using reward weighted averaging of the corresponding rollouts. It was assumed that PI2-CMA somehow exploited gradient information that was contained by the reward weighted statistics. To our knowledge we are the first to expose the principle of this gradient extraction rigorously. Our findings reveal that PI2-CMA essentially obtains gradient information similar to the forward and backward passes in the Differential Dynamic Programming (DDP) method. It is then straightforward to extend the analogy with DDP by introducing a feedback term in the policy update. This suggests a novel algorithm which we coin Path Integral Policy Improvement with Differential Dynamic Programming (PI2-DDP). The resulting algorithm is similar to the previously proposed Sampled Differential Dynamic Programming (SaDDP) but we derive the method independently as a generalization of the framework of PI2-CMA. Our derivations suggest to implement some small variations to SaDDP so to increase performance. We validated our claims on a robot trajectory learning task
Prescribed Performance Control Guided Policy Improvement for Satisfying Signal Temporal Logic Tasks
Signal temporal logic (STL) provides a user-friendly interface for defining
complex tasks for robotic systems. Recent efforts aim at designing control laws
or using reinforcement learning methods to find policies which guarantee
satisfaction of these tasks. While the former suffer from the trade-off between
task specification and computational complexity, the latter encounter
difficulties in exploration as the tasks become more complex and challenging to
satisfy. This paper proposes to combine the benefits of the two approaches and
use an efficient prescribed performance control (PPC) base law to guide
exploration within the reinforcement learning algorithm. The potential of the
method is demonstrated in a simulated environment through two sample
navigational tasks.Comment: This is the extended version of the paper accepted to the 2019
American Control Conference (ACC), Philadelphia (to be published
Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct
National audienceLa résolution de problèmes à états et actions continus par l'optimisation de politiques paramétriques est un sujet d'intérêt récent en apprentissage par renforcement. L'algorithme PI2 est un exemple de cette approche, qui bénéficie de fondements mathématiques solides tirés de la commande stochastique optimale et des outils de la théorie de l'estimation statistique. Dans cet article, nous considérons PI2 en tant que membre de la famille plus vaste des méthodes qui partagent le concept de moyenne pondérée par les probabilités pour mettre à jour itérativement des paramètres afin d'optimiser une fonction de coût. Nous comparons PI2 à d'autres membres de la même famille - la " méthode d'entropie croisée " et CMA-ES 1 - au niveau conceptuel et en termes de performance. La comparaison débouche sur la dérivation d'un nouvel algorithme que nous appelons PI2-CMA pour " Path Integral Policy Improvement with Covariance Matrix Adaptation ". Le principal avantage de PI2-CMA est qu'il détermine l'amplitude du bruit d'exploration automatiquement
- …