8,517 research outputs found
Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments
In the NIPS 2017 Learning to Run challenge, participants were tasked with
building a controller for a musculoskeletal model to make it run as fast as
possible through an obstacle course. Top participants were invited to describe
their algorithms. In this work, we present eight solutions that used deep
reinforcement learning approaches, based on algorithms such as Deep
Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region
Policy Optimization. Many solutions use similar relaxations and heuristics,
such as reward shaping, frame skipping, discretization of the action space,
symmetry, and policy blending. However, each of the eight teams implemented
different modifications of the known algorithms.Comment: 27 pages, 17 figure
Reinforcement Learning with Perturbed Rewards
Recent studies have shown that reinforcement learning (RL) models are
vulnerable in various noisy scenarios. For instance, the observed reward
channel is often subject to noise in practice (e.g., when rewards are collected
through sensors), and is therefore not credible. In addition, for applications
such as robotics, a deep reinforcement learning (DRL) algorithm can be
manipulated to produce arbitrary errors by receiving corrupted rewards. In this
paper, we consider noisy RL problems with perturbed rewards, which can be
approximated with a confusion matrix. We develop a robust RL framework that
enables agents to learn in noisy environments where only perturbed rewards are
observed. Our solution framework builds on existing RL/DRL algorithms and
firstly addresses the biased noisy reward setting without any assumptions on
the true distribution (e.g., zero-mean Gaussian noise as made in previous
works). The core ideas of our solution include estimating a reward confusion
matrix and defining a set of unbiased surrogate rewards. We prove the
convergence and sample complexity of our approach. Extensive experiments on
different DRL platforms show that trained policies based on our estimated
surrogate reward can achieve higher expected rewards, and converge faster than
existing baselines. For instance, the state-of-the-art PPO algorithm is able to
obtain 84.6% and 80.8% improvements on average score for five Atari games, with
error rates as 10% and 30% respectively.Comment: AAAI 2020 (Spotlight
- …