4 research outputs found
Optimizing Simulations with Noise-Tolerant Structured Exploration
We propose a simple drop-in noise-tolerant replacement for the standard
finite difference procedure used ubiquitously in blackbox optimization. In our
approach, parameter perturbation directions are defined by a family of
structured orthogonal matrices. We show that at the small cost of computing a
Fast Walsh-Hadamard/Fourier Transform (FWHT/FFT), such structured finite
differences consistently give higher quality approximation of gradients and
Jacobians in comparison to vanilla approaches that use coordinate directions or
random Gaussian perturbations. We find that trajectory optimizers like
Iterative LQR and Differential Dynamic Programming require fewer iterations to
solve several classic continuous control tasks when our methods are used to
linearize noisy, blackbox dynamics instead of standard finite differences. By
embedding structured exploration in a quasi-Newton optimizer (LBFGS), we are
able to learn agile walking and turning policies for quadruped locomotion, that
successfully transfer from simulation to actual hardware.We theoretically
justify our methods via bounds on the quality of gradient reconstruction and
provide a basis for applying them also to nonsmooth problems
Linear interpolation gives better gradients than Gaussian smoothing in derivative-free optimization
In this paper, we consider derivative free optimization problems, where the
objective function is smooth but is computed with some amount of noise, the
function evaluations are expensive and no derivative information is available.
We are motivated by policy optimization problems in reinforcement learning that
have recently become popular [Choromaski et al. 2018; Fazel et al. 2018;
Salimans et al. 2016], and that can be formulated as derivative free
optimization problems with the aforementioned characteristics. In each of these
works some approximation of the gradient is constructed and a (stochastic)
gradient method is applied. In [Salimans et al. 2016] the gradient information
is aggregated along Gaussian directions, while in [Choromaski et al. 2018] it
is computed along orthogonal direction. We provide a convergence rate analysis
for a first-order line search method, similar to the ones used in the
literature, and derive the conditions on the gradient approximations that
ensure this convergence. We then demonstrate via rigorous analysis of the
variance and by numerical comparisons on reinforcement learning tasks that the
Gaussian sampling method used in [Salimans et al. 2016] is significantly
inferior to the orthogonal sampling used in [Choromaski et al. 2018] as well as
more general interpolation methods.Comment: 14 pages, 2 figures. arXiv admin note: text overlap with
arXiv:1905.0133
Policies Modulating Trajectory Generators
We propose an architecture for learning complex controllable behaviors by
having simple Policies Modulate Trajectory Generators (PMTG), a powerful
combination that can provide both memory and prior knowledge to the controller.
The result is a flexible architecture that is applicable to a class of problems
with periodic motion for which one has an insight into the class of
trajectories that might lead to a desired behavior. We illustrate the basics of
our architecture using a synthetic control problem, then go on to learn
speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement
Learning and Evolutionary Strategies. We demonstrate that a simple linear
policy, when paired with a parametric Trajectory Generator for quadrupedal
gaits, can induce walking behaviors with controllable speed from 4-dimensional
IMU observations alone, and can be learned in under 1000 rollouts. We also
transfer these policies to a real robot and show locomotion with controllable
forward velocity
RMA: Rapid Motor Adaptation for Legged Robots
Successful real-world deployment of legged robots would require them to adapt
in real-time to unseen scenarios like changing terrains, changing payloads,
wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to
solve this problem of real-time online adaptation in quadruped robots. RMA
consists of two components: a base policy and an adaptation module. The
combination of these components enables the robot to adapt to novel situations
in fractions of a second. RMA is trained completely in simulation without using
any domain knowledge like reference trajectories or predefined foot trajectory
generators and is deployed on the A1 robot without any fine-tuning. We train
RMA on a varied terrain generator using bioenergetics-inspired rewards and
deploy it on a variety of difficult terrains including rocky, slippery,
deformable surfaces in environments with grass, long vegetation, concrete,
pebbles, stairs, sand, etc. RMA shows state-of-the-art performance across
diverse real-world as well as simulation experiments. Video results at
https://ashish-kmr.github.io/rma-legged-robots/Comment: RSS 2021. Webpage at https://ashish-kmr.github.io/rma-legged-robots