11 research outputs found
Policy Evaluation in Distributional LQR
Distributional reinforcement learning (DRL) enhances the understanding of the
effects of the randomness in the environment by letting agents learn the
distribution of a random return, rather than its expected value as in standard
RL. At the same time, a main challenge in DRL is that policy evaluation in DRL
typically relies on the representation of the return distribution, which needs
to be carefully designed. In this paper, we address this challenge for a
special class of DRL problems that rely on linear quadratic regulator (LQR) for
control, advocating for a new distributional approach to LQR, which we call
\emph{distributional LQR}. Specifically, we provide a closed-form expression of
the distribution of the random return which, remarkably, is applicable to all
exogenous disturbances on the dynamics, as long as they are independent and
identically distributed (i.i.d.). While the proposed exact return distribution
consists of infinitely many random variables, we show that this distribution
can be approximated by a finite number of random variables, and the associated
approximation error can be analytically bounded under mild assumptions. Using
the approximate return distribution, we propose a zeroth-order policy gradient
algorithm for risk-averse LQR using the Conditional Value at Risk (CVaR) as a
measure of risk. Numerical experiments are provided to illustrate our
theoretical results.Comment: 12page
Learning Risk-Aware Quadrupedal Locomotion using Distributional Reinforcement Learning
Deployment in hazardous environments requires robots to understand the risks
associated with their actions and movements to prevent accidents. Despite its
importance, these risks are not explicitly modeled by currently deployed
locomotion controllers for legged robots. In this work, we propose a risk
sensitive locomotion training method employing distributional reinforcement
learning to consider safety explicitly. Instead of relying on a value
expectation, we estimate the complete value distribution to account for
uncertainty in the robot's interaction with the environment. The value
distribution is consumed by a risk metric to extract risk sensitive value
estimates. These are integrated into Proximal Policy Optimization (PPO) to
derive our method, Distributional Proximal Policy Optimization (DPPO). The risk
preference, ranging from risk-averse to risk-seeking, can be controlled by a
single parameter, which enables to adjust the robot's behavior dynamically.
Importantly, our approach removes the need for additional reward function
tuning to achieve risk sensitivity. We show emergent risk sensitive locomotion
behavior in simulation and on the quadrupedal robot ANYmal
Normality-Guided Distributional Reinforcement Learning for Continuous Control
Learning a predictive model of the mean return, or value function, plays a
critical role in many reinforcement learning algorithms. Distributional
reinforcement learning (DRL) has been shown to improve performance by modeling
the value distribution, not just the mean. We study the value distribution in
several continuous control tasks and find that the learned value distribution
is empirical quite close to normal. We design a method that exploits this
property, employ variances predicted from a variance network, along with
returns, to analytically compute target quantile bars representing a normal for
our distributional value function. In addition, we propose a policy update
strategy based on the correctness as measured by structural characteristics of
the value distribution not present in the standard value function. The approach
we outline is compatible with many DRL structures. We use two representative
on-policy algorithms, PPO and TRPO, as testbeds. Our method yields
statistically significant improvements in 10 out of 16 continuous task
settings, while utilizing a reduced number of weights and achieving faster
training time compared to an ensemble-based method for quantifying value
distribution uncertainty
Sample-based Uncertainty Quantification with a Single Deterministic Neural Network
Development of an accurate, flexible, and numerically efficient uncertainty
quantification (UQ) method is one of fundamental challenges in machine
learning. Previously, a UQ method called DISCO Nets has been proposed
(Bouchacourt et al., 2016), which trains a neural network by minimizing the
energy score. In this method, a random noise vector in
is concatenated with the original input vector in
order to produce a diverse ensemble forecast despite using a single neural
network. While this method has shown promising performance on a hand pose
estimation task in computer vision, it remained unexplored whether this method
works as nicely for regression on tabular data, and how it competes with more
recent advanced UQ methods such as NGBoost. In this paper, we propose an
improved neural architecture of DISCO Nets that admits faster and more stable
training while only using a compact noise vector of dimension . We benchmark this approach on miscellaneous real-world tabular
datasets and confirm that it is competitive with or even superior to standard
UQ baselines. Moreover we observe that it exhibits better point forecast
performance than a neural network of the same size trained with the
conventional mean squared error. As another advantage of the proposed method,
we show that local feature importance computation methods such as SHAP can be
easily applied to any subregion of the predictive distribution. A new
elementary proof for the validity of using the energy score to learn predictive
distributions is also provided.Comment: 16 pages, 17 figures, 2 tables. Accepted by the 14th International
Conference on Neural Computation Theory and Applications (NCTA 2022) held as
part of IJCCI 2022, October 24-26, 2022, Valletta, Malt