26 research outputs found
An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
In this paper we introduce the idea of improving the performance of
parametric temporal-difference (TD) learning algorithms by selectively
emphasizing or de-emphasizing their updates on different time steps. In
particular, we show that varying the emphasis of linear TD()'s updates
in a particular way causes its expected update to become stable under
off-policy training. The only prior model-free TD methods to achieve this with
per-step computation linear in the number of function approximation parameters
are the gradient-TD family of methods including TDC, GTD(), and
GQ(). Compared to these methods, our _emphatic TD()_ is
simpler and easier to use; it has only one learned parameter vector and one
step-size parameter. Our treatment includes general state-dependent discounting
and bootstrapping functions, and a way of specifying varying degrees of
interest in accurately valuing different states.Comment: 29 pages This is a significant revision based on the first set of
reviews. The most important change was to signal early that the main result
is about stability, not convergenc
Utility-based Perturbed Gradient Descent: An Optimizer for Continual Learning
Modern representation learning methods often struggle to adapt quickly under
non-stationarity because they suffer from catastrophic forgetting and decaying
plasticity. Such problems prevent learners from fast adaptation since they may
forget useful features or have difficulty learning new ones. Hence, these
methods are rendered ineffective for continual learning. This paper proposes
Utility-based Perturbed Gradient Descent (UPGD), an online learning algorithm
well-suited for continual learning agents. UPGD protects useful weights or
features from forgetting and perturbs less useful ones based on their
utilities. Our empirical results show that UPGD helps reduce forgetting and
maintain plasticity, enabling modern representation learning methods to work
effectively in continual learning
Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning
Deep representation learning methods struggle with continual learning,
suffering from both catastrophic forgetting of useful units and loss of
plasticity, often due to rigid and unuseful units. While many methods address
these two issues separately, only a few currently deal with both
simultaneously. In this paper, we introduce Utility-based Perturbed Gradient
Descent (UPGD) as a novel approach for the continual learning of
representations. UPGD combines gradient updates with perturbations, where it
applies smaller modifications to more useful units, protecting them from
forgetting, and larger modifications to less useful units, rejuvenating their
plasticity. We use a challenging streaming learning setup where continual
learning problems have hundreds of non-stationarities and unknown task
boundaries. We show that many existing methods suffer from at least one of the
issues, predominantly manifested by their decreasing accuracy over tasks. On
the other hand, UPGD continues to improve performance and surpasses or is
competitive with all methods in all problems. Finally, in extended
reinforcement learning experiments with PPO, we show that while Adam exhibits a
performance drop after initial learning, UPGD avoids it by addressing both
continual learning issues.Comment: Published in the Proceedings of the 12th International Conference on
Learning Representations (ICLR 2024). Code is available at
https://github.com/mohmdelsayed/upg
Real-Time Reinforcement Learning for Vision-Based Robotics Utilizing Local and Remote Computers
Real-time learning is crucial for robotic agents adapting to ever-changing,
non-stationary environments. A common setup for a robotic agent is to have two
different computers simultaneously: a resource-limited local computer tethered
to the robot and a powerful remote computer connected wirelessly. Given such a
setup, it is unclear to what extent the performance of a learning system can be
affected by resource limitations and how to efficiently use the wirelessly
connected powerful computer to compensate for any performance loss. In this
paper, we implement a real-time learning system called the Remote-Local
Distributed (ReLoD) system to distribute computations of two deep reinforcement
learning (RL) algorithms, Soft Actor-Critic (SAC) and Proximal Policy
Optimization (PPO), between a local and a remote computer. The performance of
the system is evaluated on two vision-based control tasks developed using a
robotic arm and a mobile robot. Our results show that SAC's performance
degrades heavily on a resource-limited local computer. Strikingly, when all
computations of the learning system are deployed on a remote workstation, SAC
fails to compensate for the performance loss, indicating that, without careful
consideration, using a powerful remote computer may not result in performance
improvement. However, a carefully chosen distribution of computations of SAC
consistently and substantially improves its performance on both tasks. On the
other hand, the performance of PPO remains largely unaffected by the
distribution of computations. In addition, when all computations happen solely
on a powerful tethered computer, the performance of our system remains on par
with an existing system that is well-tuned for using a single machine. ReLoD is
the only publicly available system for real-time RL that applies to multiple
robots for vision-based tasks.Comment: Appears in Proceedings of the 2023 International Conference on
Robotics and Automation (ICRA). Source code at
https://github.com/rlai-lab/relod and companion video at
https://youtu.be/7iZKryi1xS
Correcting discount-factor mismatch in on-policy policy gradient methods
The policy gradient theorem gives a convenient form of the policy gradient in
terms of three factors: an action value, a gradient of the action likelihood,
and a state distribution involving discounting called the \emph{discounted
stationary distribution}. But commonly used on-policy methods based on the
policy gradient theorem ignores the discount factor in the state distribution,
which is technically incorrect and may even cause degenerate learning behavior
in some environments. An existing solution corrects this discrepancy by using
as a factor in the gradient estimate. However, this solution is not
widely adopted and does not work well in tasks where the later states are
similar to earlier states. We introduce a novel distribution correction to
account for the discounted stationary distribution that can be plugged into
many existing gradient estimators. Our correction circumvents the performance
degradation associated with the correction with a lower variance.
Importantly, compared to the uncorrected estimators, our algorithm provides
improved state emphasis to evade suboptimal policies in certain environments
and consistently matches or exceeds the original performance on several OpenAI
gym and DeepMind suite benchmarks