5 research outputs found
Gaussian Processes in Reinforcement Learning: Stability Analysis and Efficient Value Propagation
Control of nonlinear systems on continuous domains is a challenging task for various reasons.
For robust and accurate control of complex systems a precise model of the system dynamics is
essential. Building such highly precise dynamics models from physical knowledge often requires
substantial manual effort and poses a great challenge in industrial applications. Acquiring a model
automatically from system measurements employing regression techniques allows to decrease
manual effort and, thus, poses an interesting alternative to knowledge-based modeling. Based on
such a learned dynamics model, an approximately optimal controller can be inferred automatically.
Such approaches are the subject of model-based reinforcement learning (RL) and learn optimal
control from interactions with the system. Especially when probabilistic dynamics models such
as Gaussian processes are employed, model-based RL has been tremendously successful and has
attracted much attention from both the control and machine learning communities. However,
several problems need to be solved to facilitate widespread deployment of model-based RL for
learning control in real world scenarios. In this thesis, we address two current limitations of
model-based RL that are indispensable prerequisites for widespread deployment of model-based
RL in real world tasks.
In many real world applications a poor controller can cause severe damage to the system or
even put the safety of humans at risk. Thus, it is essential to ensure that the controlled system
behaves as desired. While this question has been studied extensively in classical control, stability
of closed-loop control systems with dynamics given as a Gaussian process has not been considered
yet. We propose an automatic tool to compute regions of the state space where the desired behavior
of the system can be guaranteed. We consider dynamics given as the mean of a GP as well as
the full GP posterior distribution. In the first case, the proposed tool constructs regions of the
state space, such that the trajectories starting in this region converge to the target state. From this
asymptotic result, we follow statements for finite time horizons and stability under the presence
of disturbances. In the second case the system dynamics is given as a GP posterior distribution.
Thus, computation of multi-step-ahead predictions requires averaging over all plausible dynamics
models given the observations. A a consequence, multi-step-ahead predictions become analytically
intractable. We propose an approximation based on numerical quadrature that can handle complex
state distributions, e.g., with multiple modes and provides upper bounds for the approximation
error. Exploiting these error bounds, we present an automatic tool to compute stability regions. In
these regions of the state space, our tool guarantees that for a finite time horizon the system behaves
as desired with a given probability. Furthermore, we analyze asymptotic behavior of closed-loop
control systems with dynamics given as a GP posterior distribution. In this case we show that for
some common choices of the prior, the system has a unique stationary distribution to which the
system state converges irrespective of the starting state.
Another major challenge of RL for real world control applications is to minimize interactions
with the system required for learning. While RL approaches based on GP dynamics models
have demonstrated great data efficiency, the average amount of required system interactions can further be reduced. To achieve this goal, we propose to employ the numerical quadrature based
approximation to propagate the value of a state. To show how this approximation can further
increase data efficiency, we employ it in the two main classes of model-based RL: policy search
and value iteration. In policy search, the state distribution must be computed to evaluate the
expected long-term reward for a policy. The proposed numerical quadrature based approximation
substantially improves estimates of the expected long-term reward and its gradients. As a result,
data efficiency is significantly increased.
For the value function based approaches for policy learning, the value propagation step is
completely characterized by the Bellman equation. However, this equation is intractable for
nonlinear dynamics. In this case, we propose a projection-based value iteration approach. We
employ numerical quadrature to facilitate projection of the value function onto a linear feature
space. Suitable features for value function representation are learned online without manual effort.
This feature learning is constructed such that upper bounds for the projection error can be obtained.
The proposed value iteration approach learns globally optimal policies and significantly benefits
from the introduced highly accurate approximations