14 research outputs found
Correcting discount-factor mismatch in on-policy policy gradient methods
The policy gradient theorem gives a convenient form of the policy gradient in
terms of three factors: an action value, a gradient of the action likelihood,
and a state distribution involving discounting called the \emph{discounted
stationary distribution}. But commonly used on-policy methods based on the
policy gradient theorem ignores the discount factor in the state distribution,
which is technically incorrect and may even cause degenerate learning behavior
in some environments. An existing solution corrects this discrepancy by using
as a factor in the gradient estimate. However, this solution is not
widely adopted and does not work well in tasks where the later states are
similar to earlier states. We introduce a novel distribution correction to
account for the discounted stationary distribution that can be plugged into
many existing gradient estimators. Our correction circumvents the performance
degradation associated with the correction with a lower variance.
Importantly, compared to the uncorrected estimators, our algorithm provides
improved state emphasis to evade suboptimal policies in certain environments
and consistently matches or exceeds the original performance on several OpenAI
gym and DeepMind suite benchmarks
Real-Time Reinforcement Learning for Vision-Based Robotics Utilizing Local and Remote Computers
Real-time learning is crucial for robotic agents adapting to ever-changing,
non-stationary environments. A common setup for a robotic agent is to have two
different computers simultaneously: a resource-limited local computer tethered
to the robot and a powerful remote computer connected wirelessly. Given such a
setup, it is unclear to what extent the performance of a learning system can be
affected by resource limitations and how to efficiently use the wirelessly
connected powerful computer to compensate for any performance loss. In this
paper, we implement a real-time learning system called the Remote-Local
Distributed (ReLoD) system to distribute computations of two deep reinforcement
learning (RL) algorithms, Soft Actor-Critic (SAC) and Proximal Policy
Optimization (PPO), between a local and a remote computer. The performance of
the system is evaluated on two vision-based control tasks developed using a
robotic arm and a mobile robot. Our results show that SAC's performance
degrades heavily on a resource-limited local computer. Strikingly, when all
computations of the learning system are deployed on a remote workstation, SAC
fails to compensate for the performance loss, indicating that, without careful
consideration, using a powerful remote computer may not result in performance
improvement. However, a carefully chosen distribution of computations of SAC
consistently and substantially improves its performance on both tasks. On the
other hand, the performance of PPO remains largely unaffected by the
distribution of computations. In addition, when all computations happen solely
on a powerful tethered computer, the performance of our system remains on par
with an existing system that is well-tuned for using a single machine. ReLoD is
the only publicly available system for real-time RL that applies to multiple
robots for vision-based tasks.Comment: Appears in Proceedings of the 2023 International Conference on
Robotics and Automation (ICRA). Source code at
https://github.com/rlai-lab/relod and companion video at
https://youtu.be/7iZKryi1xS
MaDi:Learning to Mask Distractions for Generalization in Visual Deep Reinforcement Learning
The visual world provides an abundance of information, but many input pixels received by agents often contain distracting stimuli. Autonomous agents need the ability to distinguish useful information from task-irrelevant perceptions, enabling them to generalize to unseen environments with new distractions. Existing works approach this problem using data augmentation or large auxiliary networks with additional loss functions. We introduce MaDi, a novel algorithm that learns to mask distractions by the reward signal only. In MaDi, the conventional actor-critic structure of deep reinforcement learning agents is complemented by a small third sibling, the Masker. This lightweight neural network generates a mask to determine what the actor and critic receive, such that they can focus on learning the task. We run experiments on the DeepMind Control Generalization Benchmark, the Distracting Control Suite, and a real UR5 Robotic Arm. Our algorithm improves the agent's focus with useful masks, while its efficient Masker network only adds 0.2% more parameters to the original structure, in contrast to previous work. MaDi consistently achieves generalization results better than or competitive to state-of-the-art methods.</p
MaDi: Learning to Mask Distractions for Generalization in Visual Deep Reinforcement Learning
peer reviewedThe visual world provides an abundance of information, but many input pixels
received by agents often contain distracting stimuli. Autonomous agents need
the ability to distinguish useful information from task-irrelevant perceptions,
enabling them to generalize to unseen environments with new distractions.
Existing works approach this problem using data augmentation or large auxiliary
networks with additional loss functions. We introduce MaDi, a novel algorithm
that learns to mask distractions by the reward signal only. In MaDi, the
conventional actor-critic structure of deep reinforcement learning agents is
complemented by a small third sibling, the Masker. This lightweight neural
network generates a mask to determine what the actor and critic will receive,
such that they can focus on learning the task. The masks are created
dynamically, depending on the current input. We run experiments on the DeepMind
Control Generalization Benchmark, the Distracting Control Suite, and a real UR5
Robotic Arm. Our algorithm improves the agent's focus with useful masks, while
its efficient Masker network only adds 0.2% more parameters to the original
structure, in contrast to previous work. MaDi consistently achieves
generalization results better than or competitive to state-of-the-art methods.9. Industry, innovation and infrastructur