7 research outputs found
Replay across Experiments: A Natural Extension of Off-Policy RL
Replaying data is a principal mechanism underlying the stability and data
efficiency of off-policy reinforcement learning (RL). We present an effective
yet simple framework to extend the use of replays across multiple experiments,
minimally adapting the RL workflow for sizeable improvements in controller
performance and research iteration times. At its core, Replay Across
Experiments (RaE) involves reusing experience from previous experiments to
improve exploration and bootstrap learning while reducing required changes to a
minimum in comparison to prior work. We empirically show benefits across a
number of RL algorithms and challenging control domains spanning both
locomotion and manipulation, including hard exploration tasks from egocentric
vision. Through comprehensive ablations, we demonstrate robustness to the
quality and amount of data available and various hyperparameter choices.
Finally, we discuss how our approach can be applied more broadly across
research life cycles and can increase resilience by reloading data across
random seeds or hyperparameter variations
Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning
We investigate whether Deep Reinforcement Learning (Deep RL) is able to
synthesize sophisticated and safe movement skills for a low-cost, miniature
humanoid robot that can be composed into complex behavioral strategies in
dynamic environments. We used Deep RL to train a humanoid robot with 20
actuated joints to play a simplified one-versus-one (1v1) soccer game. We first
trained individual skills in isolation and then composed those skills
end-to-end in a self-play setting. The resulting policy exhibits robust and
dynamic movement skills such as rapid fall recovery, walking, turning, kicking
and more; and transitions between them in a smooth, stable, and efficient
manner - well beyond what is intuitively expected from the robot. The agents
also developed a basic strategic understanding of the game, and learned, for
instance, to anticipate ball movements and to block opponent shots. The full
range of behaviors emerged from a small set of simple rewards. Our agents were
trained in simulation and transferred to real robots zero-shot. We found that a
combination of sufficiently high-frequency control, targeted dynamics
randomization, and perturbations during training in simulation enabled
good-quality transfer, despite significant unmodeled effects and variations
across robot instances. Although the robots are inherently fragile, minor
hardware modifications together with basic regularization of the behavior
during training led the robots to learn safe and effective movements while
still performing in a dynamic and agile way. Indeed, even though the agents
were optimized for scoring, in experiments they walked 156% faster, took 63%
less time to get up, and kicked 24% faster than a scripted baseline, while
efficiently combining the skills to achieve the longer term objectives.
Examples of the emergent behaviors and full 1v1 matches are available on the
supplementary website.Comment: Project website: https://sites.google.com/view/op3-socce