23 research outputs found
On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling
On-policy reinforcement learning (RL) algorithms perform policy updates using
i.i.d. trajectories collected by the current policy. However, after observing
only a finite number of trajectories, on-policy sampling may produce data that
fails to match the expected on-policy data distribution. This sampling error
leads to noisy updates and data inefficient on-policy learning. Recent work in
the policy evaluation setting has shown that non-i.i.d., off-policy sampling
can produce data with lower sampling error than on-policy sampling can produce.
Motivated by this observation, we introduce an adaptive, off-policy sampling
method to improve the data efficiency of on-policy policy gradient algorithms.
Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error
by collecting data with a behavior policy that increases the probability of
sampling actions that are under-sampled with respect to the current policy.
Rather than discarding data from old policies -- as is commonly done in
on-policy algorithms -- PROPS uses data collection to adjust the distribution
of previously collected data to be approximately on-policy. We empirically
evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as
discrete-action tasks and demonstrate that (1) PROPS decreases sampling error
throughout training and (2) improves the data efficiency of on-policy policy
gradient algorithms. Our work improves the RL community's understanding of a
nuance in the on-policy vs off-policy dichotomy: on-policy learning requires
on-policy data, not on-policy sampling
State-Action Similarity-Based Representations for Off-Policy Evaluation
In reinforcement learning, off-policy evaluation (OPE) is the problem of
estimating the expected return of an evaluation policy given a fixed dataset
that was collected by running one or more different policies. One of the more
empirically successful algorithms for OPE has been the fitted q-evaluation
(FQE) algorithm that uses temporal difference updates to learn an action-value
function, which is then used to estimate the expected return of the evaluation
policy. Typically, the original fixed dataset is fed directly into FQE to learn
the action-value function of the evaluation policy. Instead, in this paper, we
seek to enhance the data-efficiency of FQE by first transforming the fixed
dataset using a learned encoder, and then feeding the transformed dataset into
FQE. To learn such an encoder, we introduce an OPE-tailored state-action
behavioral similarity metric, and use this metric and the fixed dataset to
learn an encoder that models this metric. Theoretically, we show that this
metric allows us to bound the error in the resulting OPE estimate. Empirically,
we show that other state-action similarity metrics lead to representations that
cannot represent the action-value function of the evaluation policy, and that
our state-action representation method boosts the data-efficiency of FQE and
lowers OPE error relative to other OPE-based representation learning methods on
challenging OPE tasks. We also empirically show that the learned
representations significantly mitigate divergence of FQE under varying
distribution shifts. Our code is available here:
https://github.com/Badger-RL/ROPE.Comment: Accepted to Neural Information Processing Systems (NeurIPS) 202
Understanding when Dynamics-Invariant Data Augmentations Benefit Model-Free Reinforcement Learning Updates
Recently, data augmentation (DA) has emerged as a method for leveraging
domain knowledge to inexpensively generate additional data in reinforcement
learning (RL) tasks, often yielding substantial improvements in data
efficiency. While prior work has demonstrated the utility of incorporating
augmented data directly into model-free RL updates, it is not well-understood
when a particular DA strategy will improve data efficiency. In this paper, we
seek to identify general aspects of DA responsible for observed learning
improvements. Our study focuses on sparse-reward tasks with dynamics-invariant
data augmentation functions, serving as an initial step towards a more general
understanding of DA and its integration into RL training. Experimentally, we
isolate three relevant aspects of DA: state-action coverage, reward density,
and the number of augmented transitions generated per update (the augmented
replay ratio). From our experiments, we draw two conclusions: (1) increasing
state-action coverage often has a much greater impact on data efficiency than
increasing reward density, and (2) decreasing the augmented replay ratio
substantially improves data efficiency. In fact, certain tasks in our empirical
study are solvable only when the replay ratio is sufficiently low
A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret
In various control task domains, existing controllers provide a baseline
level of performance that -- though possibly suboptimal -- should be
maintained. Reinforcement learning (RL) algorithms that rely on extensive
exploration of the state and action space can be used to optimize a control
policy. However, fully exploratory RL algorithms may decrease performance below
a baseline level during training. In this paper, we address the issue of online
optimization of a control policy while minimizing regret w.r.t a baseline
policy performance. We present a joint imitation-reinforcement learning
framework, denoted JIRL. The learning process in JIRL assumes the availability
of a baseline policy and is designed with two objectives in mind \textbf{(a)}
leveraging the baseline's online demonstrations to minimize the regret w.r.t
the baseline policy during training, and \textbf{(b)} eventually surpassing the
baseline performance. JIRL addresses these objectives by initially learning to
imitate the baseline policy and gradually shifting control from the baseline to
an RL agent. Experimental results show that JIRL effectively accomplishes the
aforementioned objectives in several, continuous action-space domains. The
results demonstrate that JIRL is comparable to a state-of-the-art algorithm in
its final performance while incurring significantly lower baseline regret
during training in all of the presented domains. Moreover, the results show a
reduction factor of up to in baseline regret over a state-of-the-art
baseline regret minimization approach.Comment: IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), 202
Tackling Unbounded State Spaces in Continuing Task Reinforcement Learning
While deep reinforcement learning (RL) algorithms have been successfully
applied to many tasks, their inability to extrapolate and strong reliance on
episodic resets inhibits their applicability to many real-world settings. For
instance, in stochastic queueing problems, the state space can be unbounded and
the agent may have to learn online without the system ever being reset to
states the agent has seen before. In such settings, we show that deep RL agents
can diverge into unseen states from which they can never recover due to the
lack of resets, especially in highly stochastic environments. Towards
overcoming this divergence, we introduce a Lyapunov-inspired reward shaping
approach that encourages the agent to first learn to be stable (i.e. to achieve
bounded cost) and then to learn to be optimal. We theoretically show that our
reward shaping technique reduces the rate of divergence of the agent and
empirically find that it prevents it. We further combine our reward shaping
approach with a weight annealing scheme that gradually introduces optimality
and log-transform of state inputs, and find that these techniques enable deep
RL algorithms to learn high performing policies when learning online in
unbounded state space domains