Search CORE

11 research outputs found

Policy Evaluation in Distributional LQR

Author: Abate Alessandro
Gao Yulong
Johansson Karl H.
Wang Siyi
Wang Zifan
Zavlanos Michael M.
Publication venue
Publication date: 23/03/2023
Field of study

Distributional reinforcement learning (DRL) enhances the understanding of the effects of the randomness in the environment by letting agents learn the distribution of a random return, rather than its expected value as in standard RL. At the same time, a main challenge in DRL is that policy evaluation in DRL typically relies on the representation of the return distribution, which needs to be carefully designed. In this paper, we address this challenge for a special class of DRL problems that rely on linear quadratic regulator (LQR) for control, advocating for a new distributional approach to LQR, which we call \emph{distributional LQR}. Specifically, we provide a closed-form expression of the distribution of the random return which, remarkably, is applicable to all exogenous disturbances on the dynamics, as long as they are independent and identically distributed (i.i.d.). While the proposed exact return distribution consists of infinitely many random variables, we show that this distribution can be approximated by a finite number of random variables, and the associated approximation error can be analytically bounded under mild assumptions. Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR using the Conditional Value at Risk (CVaR) as a measure of risk. Numerical experiments are provided to illustrate our theoretical results.Comment: 12page

arXiv.org e-Print Archive

Learning Risk-Aware Quadrupedal Locomotion using Distributional Reinforcement Learning

Author: Frey Jonas
Hutter Marco
Miki Takahiro
Schneider Lukas
Publication venue
Publication date: 25/09/2023
Field of study

Deployment in hazardous environments requires robots to understand the risks associated with their actions and movements to prevent accidents. Despite its importance, these risks are not explicitly modeled by currently deployed locomotion controllers for legged robots. In this work, we propose a risk sensitive locomotion training method employing distributional reinforcement learning to consider safety explicitly. Instead of relying on a value expectation, we estimate the complete value distribution to account for uncertainty in the robot's interaction with the environment. The value distribution is consumed by a risk metric to extract risk sensitive value estimates. These are integrated into Proximal Policy Optimization (PPO) to derive our method, Distributional Proximal Policy Optimization (DPPO). The risk preference, ranging from risk-averse to risk-seeking, can be controlled by a single parameter, which enables to adjust the robot's behavior dynamically. Importantly, our approach removes the need for additional reward function tuning to achieve risk sensitivity. We show emergent risk sensitive locomotion behavior in simulation and on the quadrupedal robot ANYmal

arXiv.org e-Print Archive

Normality-Guided Distributional Reinforcement Learning for Continuous Control

Author: Byun Ju-Seung
Perrault Andrew
Publication venue
Publication date: 17/01/2024
Field of study

Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) has been shown to improve performance by modeling the value distribution, not just the mean. We study the value distribution in several continuous control tasks and find that the learned value distribution is empirical quite close to normal. We design a method that exploits this property, employ variances predicted from a variance network, along with returns, to analytically compute target quantile bars representing a normal for our distributional value function. In addition, we propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds. Our method yields statistically significant improvements in 10 out of 16 continuous task settings, while utilizing a reduced number of weights and achieving faster training time compared to an ensemble-based method for quantifying value distribution uncertainty

arXiv.org e-Print Archive

Sample-based Uncertainty Quantification with a Single Deterministic Neural Network

Author: Gupta Chetan
Kanazawa Takuya
Publication venue
Publication date: 03/11/2022
Field of study

Development of an accurate, flexible, and numerically efficient uncertainty quantification (UQ) method is one of fundamental challenges in machine learning. Previously, a UQ method called DISCO Nets has been proposed (Bouchacourt et al., 2016), which trains a neural network by minimizing the energy score. In this method, a random noise vector in

\mathbb{R}^{10\text{--}100}

is concatenated with the original input vector in order to produce a diverse ensemble forecast despite using a single neural network. While this method has shown promising performance on a hand pose estimation task in computer vision, it remained unexplored whether this method works as nicely for regression on tabular data, and how it competes with more recent advanced UQ methods such as NGBoost. In this paper, we propose an improved neural architecture of DISCO Nets that admits faster and more stable training while only using a compact noise vector of dimension

\sim \mathcal{O}(1)

. We benchmark this approach on miscellaneous real-world tabular datasets and confirm that it is competitive with or even superior to standard UQ baselines. Moreover we observe that it exhibits better point forecast performance than a neural network of the same size trained with the conventional mean squared error. As another advantage of the proposed method, we show that local feature importance computation methods such as SHAP can be easily applied to any subregion of the predictive distribution. A new elementary proof for the validity of using the energy score to learn predictive distributions is also provided.Comment: 16 pages, 17 figures, 2 tables. Accepted by the 14th International Conference on Neural Computation Theory and Applications (NCTA 2022) held as part of IJCCI 2022, October 24-26, 2022, Valletta, Malt

arXiv.org e-Print Archive

A Novel Bayes' Theorem for Upper Probabilities

Author: Caprio Michele
Hüllermeier Eyke
Lee Insup
Sale Yusuf
Publication venue
Publication date: 01/01/2023
Field of study

The University of Manchester - Institutional Repository