85 research outputs found
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
Online Learning with Off-Policy Feedback
We study the problem of online learning in adversarial bandit problems under
a partial observability model called off-policy feedback. In this sequential
decision making problem, the learner cannot directly observe its rewards, but
instead sees the ones obtained by another unknown policy run in parallel
(behavior policy). Instead of a standard exploration-exploitation dilemma, the
learner has to face another challenge in this setting: due to limited
observations outside of their control, the learner may not be able to estimate
the value of each policy equally well. To address this issue, we propose a set
of algorithms that guarantee regret bounds that scale with a natural notion of
mismatch between any comparator policy and the behavior policy, achieving
improved performance against comparators that are well-covered by the
observations. We also provide an extension to the setting of adversarial linear
contextual bandits, and verify the theoretical guarantees via a set of
experiments. Our key algorithmic idea is adapting the notion of pessimistic
reward estimators that has been recently popular in the context of off-policy
reinforcement learning
Stochastic Variance-Reduced Policy Gradient
In this paper, we propose a novel reinforcement- learning algorithm
consisting in a stochastic variance-reduced version of policy gradient for
solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient
(SVRG) methods have proven to be very successful in supervised learning.
However, their adaptation to policy gradient is not straightforward and needs
to account for I) a non-concave objective func- tion; II) approximations in the
full gradient com- putation; and III) a non-stationary sampling pro- cess. The
result is SVRPG, a stochastic variance- reduced policy gradient algorithm that
leverages on importance weights to preserve the unbiased- ness of the gradient
estimate. Under standard as- sumptions on the MDP, we provide convergence
guarantees for SVRPG with a convergence rate that is linear under increasing
batch sizes. Finally, we suggest practical variants of SVRPG, and we
empirically evaluate them on continuous MDPs
Adaptive Batch Size for Safe Policy Gradients
International audiencePolicy gradient methods are among the best Reinforcement Learning (RL) techniques to solve complex control problems. In real-world RL applications, it is common to have a good initial policy whose performance needs to be improved and it may not be acceptable to try bad policies during the learning process. Although several methods for choosing the step size exist, research paid less attention to determine the batch size, that is the number of samples used to estimate the gradient direction for each update of the policy parameters. In this paper, we propose a set of methods to jointly optimize the step and the batch sizes that guarantee (with high probability) to improve the policy performance after each update. Besides providing theoretical guarantees, we show numerical simulations to analyse the behaviour of our methods
Policy Optimization as Online Learning with Mediator Feedback
Policy Optimization (PO) is a widely used approach to address continuous
control tasks. In this paper, we introduce the notion of mediator feedback that
frames PO as an online learning problem over the policy space. The additional
available information, compared to the standard bandit feedback, allows reusing
samples generated by one policy to estimate the performance of other policies.
Based on this observation, we propose an algorithm, RANDomized-exploration
policy Optimization via Multiple Importance Sampling with Truncation
(RANDOMIST), for regret minimization in PO, that employs a randomized
exploration strategy, differently from the existing optimistic approaches. When
the policy space is finite, we show that under certain circumstances, it is
possible to achieve constant regret, while always enjoying logarithmic regret.
We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST
to compact policy spaces. Finally, we provide numerical simulations on finite
and compact policy spaces, in comparison with PO and bandit baselines
Gradient-Aware Model-based Policy Search
Traditional model-based reinforcement learning approaches learn a model of
the environment dynamics without explicitly considering how it will be used by
the agent. In the presence of misspecified model classes, this can lead to poor
estimates, as some relevant available information is ignored. In this paper, we
introduce a novel model-based policy search approach that exploits the
knowledge of the current agent policy to learn an approximate transition model,
focusing on the portions of the environment that are most relevant for policy
improvement. We leverage a weighting scheme, derived from the minimization of
the error on the model-based policy gradient estimator, in order to define a
suitable objective function that is optimized for learning the approximate
transition model. Then, we integrate this procedure into a batch policy
improvement algorithm, named Gradient-Aware Model-based Policy Search (GAMPS),
which iteratively learns a transition model and uses it, together with the
collected trajectories, to compute the new policy parameters. Finally, we
empirically validate GAMPS on benchmark domains analyzing and discussing its
properties
A GPU-Accelerated Modern Fortran Version of the ECHO Code for Relativistic Magnetohydrodynamics
The numerical study of relativistic magnetohydrodynamics (MHD) plays a
crucial role in high-energy astrophysics, but unfortunately is computationally
demanding, given the complex physics involved (high Lorentz factor flows,
extreme magnetization, curved spacetimes near compact objects) and the large
variety of spatial scales needed to resolve turbulent motions. A great benefit
comes from the porting of existing codes running on standard processors to
GPU-based platforms. However, this usually requires a drastic rewriting of the
original code, the use of specific languages like CUDA, and a complex analysis
of data management and optimization of parallel processes. Here we describe the
porting of the ECHO code for special and general relativistic MHD to
accelerated devices, simply based on native Fortran language built-in
constructs, especially 'do concurrent' loops, few OpenACC directives, and the
straightforward data management provided by the Unified Memory option of NVIDIA
compilers.Thanks to these very minor modifications to the original code, the
new version of ECHO runs at least 16 times faster on GPU platforms compared to
CPU-based ones. The chosen benchmark is the 3D propagation of a relativistic
MHD Alfv\'en wave, for which strong and weak scaling tests performed on the
LEONARDO pre-exascale supercomputer at CINECA are provided (using up to 256
nodes corresponding to 1024 GPUs, and over 14 billion cells). Finally, an
example of high-resolution relativistic MHD Alfv\'enic turbulence simulation is
shown, demonstrating the potential for astrophysical plasmas of the new
GPU-based version of ECHO.Comment: Accepted for publication on Fluids, MDPI, 17 page
- …