30 research outputs found
Reinforcement Learning with Perturbed Rewards
Recent studies have shown that reinforcement learning (RL) models are
vulnerable in various noisy scenarios. For instance, the observed reward
channel is often subject to noise in practice (e.g., when rewards are collected
through sensors), and is therefore not credible. In addition, for applications
such as robotics, a deep reinforcement learning (DRL) algorithm can be
manipulated to produce arbitrary errors by receiving corrupted rewards. In this
paper, we consider noisy RL problems with perturbed rewards, which can be
approximated with a confusion matrix. We develop a robust RL framework that
enables agents to learn in noisy environments where only perturbed rewards are
observed. Our solution framework builds on existing RL/DRL algorithms and
firstly addresses the biased noisy reward setting without any assumptions on
the true distribution (e.g., zero-mean Gaussian noise as made in previous
works). The core ideas of our solution include estimating a reward confusion
matrix and defining a set of unbiased surrogate rewards. We prove the
convergence and sample complexity of our approach. Extensive experiments on
different DRL platforms show that trained policies based on our estimated
surrogate reward can achieve higher expected rewards, and converge faster than
existing baselines. For instance, the state-of-the-art PPO algorithm is able to
obtain 84.6% and 80.8% improvements on average score for five Atari games, with
error rates as 10% and 30% respectively.Comment: AAAI 2020 (Spotlight
Stable deep reinforcement learning method by predicting uncertainty in rewards as a subtask
In recent years, a variety of tasks have been accomplished by deep
reinforcement learning (DRL). However, when applying DRL to tasks in a
real-world environment, designing an appropriate reward is difficult. Rewards
obtained via actual hardware sensors may include noise, misinterpretation, or
failed observations. The learning instability caused by these unstable signals
is a problem that remains to be solved in DRL. In this work, we propose an
approach that extends existing DRL models by adding a subtask to directly
estimate the variance contained in the reward signal. The model then takes the
feature map learned by the subtask in a critic network and sends it to the
actor network. This enables stable learning that is robust to the effects of
potential noise. The results of experiments in the Atari game domain with
unstable reward signals show that our method stabilizes training convergence.
We also discuss the extensibility of the model by visualizing feature maps.
This approach has the potential to make DRL more practical for use in noisy,
real-world scenarios.Comment: Published as a conference paper at ICONIP 202