Finding an optimal policy in a reinforcement learning (RL) framework with continuous state and action spaces is challenging. Approximate solutions are often inevitable. GPDP is an approximate dynamic programming algorithm based on Gaussian process (GP) models for the value functions. In this paper, we extend GPDP to the case of unknown transition dynamics. After building a GP model for the transition dynamics, we apply GPDP to this model and determine a continuous-valued policy in the entire state space. We apply the resulting controller to the underpowered pendulum swing up. Moreover, we compare our results on this RL task to a nearly optimal discrete DP solution in a fully known environment

Deisenroth, MP

Peters, J

Rasmussen, CE

English

Deisenroth, M.

Rasmussen, C.

Peters, J.

MPG.PuRe

Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters:Model-Based Reinforcement Learning withContinuous States and Actionsin Proceedings of the 16th European Symposium onArtificial Neural Networks (ESANN 2008),pages 19–24, Bruges, Belgium, April 2008.Model-Based Reinforcement Learning withContinuous States and ActionsMarc P. Deisenroth1∗, Carl E. Rasmussen1,2, and Jan Peters21- University of Cambridge - Department of EngineeringTrumpington Street, Cambridge CB2 1PZ - UK{mpd37|cer54}@cam.ac.uk2- Max Planck Institute for Biological CyberneticsSpemannstraße 38, 72070 Tübingen - Germanyjan.peters@tuebingen.mpg.deAbstract.Finding an optimal policy in a reinforcement learning (RL) framework withcontinuous state and action spaces is challenging. Approximate solutionsare often inevitable. GPDP is an approximate dynamic programming algo-rithm based on Gaussian process (GP) models for the value functions. Inthis paper, we extend GPDP to the case of unknown transition dynamics.After building a GP model for the transition dynamics, we apply GPDPto this model and determine a continuous-valued policy in the entire statespace. We apply the resulting controller to the underpowered pendulumswing up. Moreover, we compare our results on this RL task to a nearlyoptimal discrete DP solution in a fully known environment.1 IntroductionIn reinforcement learning (RL) an agent must learn good decision policies basedon observations or trial-and-error interactions with a dynamic environment. Dy-namic programming (DP) is a common methodology for achieving this taskby solving the Bellman equation, which characterizes properties of the valuefunction. Usually, standard table-based algorithms for discrete setups do notstraightforwardly apply to continuous state and action domains. Function ap-proximators can generalize the value function to continuous-valued state spaces,but usually they are limited to discrete action domains [1]. In case of non-probabilistic, parametric function approximation we are restricted to a fixedclass of functions and might run into problems in case of noisy data. A state-of-the-art nonparametric Bayesian regression method is provided by the Gaussianprocess (GP) framework [2]. Model-based policy iteration in continuous stateand action spaces based on value function evaluation using GPs is presentedin [3]. In [4], model-free policy iteration is proposed within the GP frameworkto perform both policy evaluation and policy improvement. Gaussian processdynamic programming (GPDP) is a model-based dynamic programming algo-rithm for fully known dynamics in which the value functions are modeled byGPs [5].∗M. P. Deisenroth is supported by the German Research Foundation (DFG) through grantRA 1030/1.In this paper, we extend GPDP to the case of unknown deterministic dynam-ics by using a learned system model. Moreover, we compare the performance ofthe resulting policy to the policy of a benchmark controller using exact dynamics.2 Reinforcement Learning with Unknown DynamicsUncertainty is a key property of RL. Modeling uncertain functions properlyis an extremely hard problem in general. GP regression is a powerful Bayesianmethod to model latent functions without being restricted to a specific paramet-ric function class, such as polynomials. Thus, throughout this paper, we modellatent functions by means of GPs that generalize latent functions from a small,finite training set to the entire continuous-valued space. Moreover, confidenceintervals are provided.We consider the undiscounted finite-horizon RL problem of finding a policyπ∗ that minimizes the long-term loss gterm(xN ) +∑N−1k=0 g(xk,uk) where k in-dexes discrete time. Here, g is the immediate loss that depends on the statex ∈ IRnx and a chosen control action u = π(x) ∈ IRnu . A state-dependentterminal loss is denoted by gterm.Example: SetupThroughout this paper, we use the underpowered pendulum swing up as running exam-ple. Initially the pendulum is hanging down. The goal is to swing the pendulum up andto balance it in the inverted position. This task has previously been considered a hardproblem [6]. We assume pendulum dynamics following the ODEϕ̈(t) =−µϕ̇(t) +mgl sin(ϕ(t)) + u(t)ml2(1)where µ = 0.05 kg m2/s is the coefficient of friction. The applied torque is restricted tou ∈ [−5, 5] Nm. The characteristic pendulum frequency is approximately 2 s. Angle andangular velocity are denoted by ϕ and ϕ̇ and given in rad and rad/s, respectively. Thesystem can be influenced by applying a force u any 200 ms. The pendulum dynamics (1) arediscretized in time with 200 ms between two samples. The immediate loss g is g(xk, uk) =xTk diag([1, 0.2])xk + 0.1u2k, the optimization horizon is 2 s.2.1 Learning System DynamicsIn the considered RL setup we assume a priori unknown deterministic systemdynamics. If possible, it seems worth estimating a dynamics model since, intu-itively, model-based methods make better use of available information [1]. Inthe first step, we attempt to model the system based on observations of sampledtrajectories. We consider discrete-time systems of the form xk+1 = f(xk,uk) .We use a Gaussian process model, the dynamics GP, to model the dynamicsand write f ∼ GPf . For each output dimension of f the GP model is fullyspecified by its mean and covariance functions revealing prior beliefs about thelatent function [2]. For any new input (x∗,u∗) the predictive distribution off(x∗,u∗) conditioned on the training data is Gaussian with mean vector µ∗and covariance matrix Σ∗. The posterior GP reveals the remaining uncertaintyAlgorithm 1 GPDP using system model GPf1: input: GPf ,X ,U2: V ∗N (X ) = gterm(X ) . terminal loss3: V ∗N ( · ) ∼ GPv . GP model for V ∗N4: for k = N − 1 to 0 do . DP recursion (in time)5: for all xi ∈ X do . for all states6: Qk(xi,U) = g(xi,U) + E[V ∗k+1(xk+1)|xi,U ,GPf ]7: Qk(xi, · ) ∼ GPq . GP model for Q8: π∗k(xi) ∈ arg minuQk(xi,u)9: V ∗k (xi) = Qk(xi, π∗k(xi))10: end for11: V ∗k ( · ) ∼ GPv . GP model for V ∗k12: end for13: return π∗0(X ) . return set of optimal actionsabout the underlying latent function f . In the limit of infinite data this uncer-tainty tends to zero and the GP model converges to the deterministic system,such that GPf = f .Example: Learning the pendulum dynamicsIn case of the underpowered pendulum swing up, the standard deviation of the model issmaller than 0.03 for 400 training examples. More training points increase the confidence.The absolute error between the underlying dynamics f and the mean of the GP model GPfis smaller than 0.04.2.2 Application of Gaussian Process Dynamic ProgrammingUsing the dynamics GP to describe the latent system dynamics f , we applyGPDP to derive optimal actions based on finite training sets X of states andU of actions. The elements of the training sets are randomly distributed withintheir domains. GPDP generalizes dynamic programming to continuous stateand action domains. A sketch of GPDP is given in Algorithm 1. We model bothlatent functions V ∗ and Q by means of Gaussian processes. For each x ∈ X weuse independent GP models for Q(x, · ) rather than modeling Q( · , · ) in jointstate-action space. This idea is largely based on two observations. First, a goodmodel of Q in joint state-action space requires substantially more training pointsand makes standard GP models computationally very expensive. Second, theQ-function can be discontinuous in x, as well as in u direction. We eliminate onepossible source of discontinuity by treatingQ(xi, · ) andQ(xj , · ) as independent.To determine the Q-value in line 6 of the GPDP algorithm, we have to solvean integral of the formE[V ∗(f(x,u))|x,u] =∫ ∫V ∗(f(x,u))p(f(x,u)|x,u)p(V ∗)dfdV ∗ (2)where both the system function f and the value function V ∗ are modeled byGPf and GPv, respectively. Therefore, the value of the integral is a randomvariable. The GP model of V ∗ with squared exponential (SE) covariance functionand the Gaussian predictive distribution p(f(x,u)|x,u), provided by the systemmodel, allow us to determine a distribution of the uncertain integral value (2) byapplying the Bayesian Monte Carlo method [7]. We have to integrate over bothsources of uncertainty the model uncertainty of GPf and the uncertainty aboutV ∗. However, if the model uncertainty of GPf tends to zero (many data), theintegrand of (2) tends to V ∗(f(x,u))p(V ∗), and the distribution of (2) equalsthe distribution of V ∗. In this case, E[V ∗(f(x,u))] = mv(mf (x,u)).2.3 Determination of a Closed-Loop PolicyWe extend the policy from a finite training set to the entire state space asfollows. The actions π∗0(X ) of the training set X returned by GPDP are regardedas (uncertain) measurements of an underlying optimal policy. We attempt tomodel the latent optimal policy by means of a GP. Depending on the loss functionand the system dynamics, the choice of an appropriate covariance function isextremely difficult. We approach the problem from a different direction: Inapplications of control algorithms to real underpowered robots, smoothness ofactions is desired to protect the actuators. Thus, we assume that a close-to-optimal policy for an underpowered system is at least piecewise smooth. Possiblediscontinuities appear at boundaries of a manifold where the sign of the controlsignal changes.With this assumption we attempt to model the latent policy π∗ with twoGaussian processes. One GP is trained only on the subset π∗+(X ) ⊂ π∗0(X ) ofpositive actions, the other GP on the remaining set denoted by π∗−(X ). Pleasenote that we know the values π∗0(X ) from the GPDP algorithm. Both GP modelsplay the role of local experts in the region of their training sets. We assume thatthe latent policy is locally smooth. Thus, we use the SE covariance functionin GPπ+ and GPπ− , respectively. It remains to predict the class for new querypoints. This problem is solved by a binary classifier. We greedily choose theGP with higher posterior class probability to predict the optimal action for anystate. The classifier plays a similar role as the gating network in a mixture-of-experts setting [8]. Finally, we obtain a combined GP model GPπ of an optimalpolicy in a continuous-valued part of the state space that models discontinuitiesalong the boundaries of a manifold.Example: Policy learningThe model of an optimal policy for the underpowered pendulum swing up is given inFigure 1. The black crosses and white circles mark the training sets π+(X ) and π−(X ),respectively. Discontinuities at the boundaries of the diagonal band (upper left to lowerright corner) represent states where maximum applicable torque is just not strong enoughto bring the pendulum to the inverted position. The decision of the controller is to usetorque in opposite direction, to exploit the dynamics of the pendulum, and to bring it tothe goal state from the other side.Especially in real systems where actions have to be smooth to protect actua-tors, the suggested method seems reasonable and applicable. Moreover, it seemsmore general compared to directly learning the policy from training data with a−3 −2 −1 0 1 2 3−6−4−20246angle in radang.vel. in rad/s  −6−4−20246Fig. 1: Optimal policy modeled by switching between two GPs, any of which istrained on different subsets (white circles and black crosses) of optimal actionsπ∗0(X ) returned by GPDP.single GP and a problem-specific covariance function. Correct problem-specificcovariance functions may perform better. However, a lot of expert knowledge isneeded in case of possibly nonsmooth predicted policies.GPDP scales in O(|X ||U|3 + |X |3) per time step: A GP model for Q(x, · )for all x ∈ X and one GP model for the V -function are used. Standard DPscales in O(|XDP|2|UDP|) with substantially bigger sets UDP,XDP. A strengthof GPDP is that in case of stochastic dynamics the corresponding RL problemcan be solved with no additional computational and memory requirements. Ingeneral, the benchmark controller is no longer applicable because of enormousmemory demand if a full transition matrix has to be stored. Another point tobe mentioned is that the sets X , U in GPDP can be time-variant. They justserve as training points for the GP models generalizing the value functions tocontinuous-valued domains.Example: PerformanceWe simulate the time-discretized pendulum for 5 s. We consider a benchmark controllerbased on exact DP in discretized state and action spaces with 7.5 × 107 states in jointstate-action space as almost optimal. Figure 2 shows the results of applying both con-trollers. The dashed blue lines represent the optimal solution of the DP controller withfully known dynamics. The solid green lines are the solution of the GPDP controller basedon learned dynamics. The upper panels describe trajectories of angle ϕ and angular velocityϕ̇, respectively. Applied control actions are given in the lower panel where the error bars inthe GPDP solution describe the confidence when applying the corresponding control action(twice standard deviation). The state trajectories (upper panels) almost coincide, and thechosen control actions differ slightly. In the considered example, GPDP based on learneddynamics causes 1.66% more cumulative loss over 5 s.0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−202time in sangle  GPDPDP0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5024time in sang.vel.  GPDPDP0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−505time in sctrl.  GPDPDPFig. 2: Trajectories of angle, angular velocity and applied actions for discretizedDP and continuous GPDP. The trajectories almost coincide although GPDPuses learned dynamics.3 SummaryIn this paper, we extended a Gaussian process dynamic programming algorithmto the case of unknown deterministic dynamics. We assumed that the systemcan be modeled by means of a Gaussian process. For this setting, we obtained aclosed-loop policy for continuous-valued state and action domains. We showedthat in the case of the underpowered pendulum swing up, the policy based onlearned system model performs almost as well as a computationally expensiveoptimal controller.References[1] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Sci-entific, Belmont, MA, USA, 1996.[2] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for MachineLearning. The MIT Press, Cambridge, MA, USA, 2006.[3] Carl E. Rasmussen and Malte Kuss. Gaussian Processes in Reinforcement Learning. InAdvances in Neural Information Processing Systems 16, pp. 751–759, June 2004.[4] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement Learning with Gaussian Pro-cesses. In 22nd International Conference on Machine Learning, pp. 201–208, August 2005.[5] Marc P. Deisenroth, Jan Peters, and Carl E. Rasmussen. Approximate Dynamic Program-ming with Gaussian Processes. In 27th American Control Conference, June 2008.[6] Christopher G. Atkeson and Stefan Schaal. Robot Learning from Demonstration. In 14thInternational Conference on Machine Learning, pp. 12–20, July 1997.[7] Carl E. Rasmussen and Zoubin Ghahramani. Bayesian Monte Carlo. In Advances in NeuralInformation Processing Systems 15, pp. 489–496, 2003.[8] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. AdaptiveMixtures of Local Experts. Neural Computation, 3:79–87, 1991.

Model-Based Reinforcement Learning with Continuous States and Actions

Deisenroth, M. P.

Rasmussen, C. E.

TUbiblio

Spiral - Imperial College Digital Repository

ErrataModel-Based Reinforcement Learning withContinuous States and Actions(Deisenroth, Rasmussen, Peters, ESANN 2008)April 15, 20081 Errata• page 2: For the example system (underpowered pendulum) we used theimmediate lossg(xk, uk) = 1− exp(−xTk diag([1, 0.2])xk − 0.1u2k)instead of xTk diag([1, 0.2])xk + 0.1u2k as claimed in the paper.• page 5: The differences between the cumulative loss of the GP controllerand the DP controller is 3.1%, not 1.66%.• page 6: Figure 2 (page 6) in the paper should be replaced by the followingone.0 1 2 3 4 5−202time in sangle  GPDPDP0 1 2 3 4 5024time in sang.vel.  GPDPDP0 1 2 3 4 5−505time in sctrl.  GPDPDP1

Adaptive Mixtures of Local Experts.

Approximate Dynamic Programming with Gaussian Processes.

Bayesian Monte Carlo.

Gaussian Processes for Machine Learning.

Gaussian Processes in Reinforcement Learning.

Reinforcement Learning with Gaussian Processes.

Robot Learning from Demonstration.

Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.218.5599

Model-Based Reinforcement Learning with Continuous States and Actions

Abstract

Similar works

Full text

Available Versions

MPG.PuRe

MPG.PuRe

TUbiblio

Spiral - Imperial College Digital Repository

TUbiblio