70 research outputs found
Modelling transition dynamics in MDPs with RKHS embeddings
We propose a new, nonparametric approach to learning and representing
transition dynamics in Markov decision processes (MDPs), which can be combined
easily with dynamic programming methods for policy optimisation and value
estimation. This approach makes use of a recently developed representation of
conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert
space (RKHS). Such representations bypass the need for estimating transition
probabilities or densities, and apply to any domain on which kernels can be
defined. This avoids the need to calculate intractable integrals, since
expectations are represented as RKHS inner products whose computation has
linear complexity in the number of points used to represent the embedding. We
provide guarantees for the proposed applications in MDPs: in the context of a
value iteration algorithm, we prove convergence to either the optimal policy,
or to the closest projection of the optimal policy in our model class (an
RKHS), under reasonable assumptions. In experiments, we investigate a learning
task in a typical classical control setting (the under-actuated pendulum), and
on a navigation problem where only images from a sensor are observed. For
policy optimisation we compare with least-squares policy iteration where a
Gaussian process is used for value function estimation. For value estimation we
also compare to the NPDP method. Our approach achieves better performance in
all experiments.Comment: ICML201
Learning of non-parametric control policies with high-dimensional state features
Learning complex control policies from highdimensional sensory input is a challenge for
reinforcement learning algorithms. Kernel methods that approximate values functions
or transition models can address this problem. Yet, many current approaches rely on
instable greedy maximization. In this paper, we develop a policy search algorithm that
integrates robust policy updates and kernel embeddings. Our method can learn nonparametric
control policies for infinite horizon continuous MDPs with high-dimensional
sensory representations. We show that our method outperforms related approaches, and
that our algorithm can learn an underpowered swing-up task task directly from highdimensional
image data
No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach
We consider the regret minimization problem in reinforcement learning (RL) in
the episodic setting. In many real-world RL environments, the state and action
spaces are continuous or very large. Existing approaches establish regret
guarantees by either a low-dimensional representation of the stochastic
transition model or an approximation of the -functions. However, the
understanding of function approximation schemes for state-value functions
largely remains missing. In this paper, we propose an online model-based RL
algorithm, namely the CME-RL, that learns representations of transition
distributions as embeddings in a reproducing kernel Hilbert space while
carefully balancing the exploitation-exploration tradeoff. We demonstrate the
efficiency of our algorithm by proving a frequentist (worst-case) regret bound
that is of order , where is the
episode length, is the total number of time steps and is an
information theoretic quantity relating the effective dimension of the
state-action feature space. Our method bypasses the need for estimating
transition probabilities and applies to any domain on which kernels can be
defined. It also brings new insights into the general theory of kernel methods
for approximate inference and RL regret minimization
Data-Driven Stochastic Optimal Control Using Kernel Gradients
We present an empirical, gradient-based method for solving data-driven
stochastic optimal control problems using the theory of kernel embeddings of
distributions. By embedding the integral operator of a stochastic kernel in a
reproducing kernel Hilbert space, we can compute an empirical approximation of
stochastic optimal control problems, which can then be solved efficiently using
the properties of the RKHS. Existing approaches typically rely upon finite
control spaces or optimize over policies with finite support to enable
optimization. In contrast, our approach uses kernel-based gradients computed
using observed data to approximate the cost surface of the optimal control
problem, which can then be optimized using gradient descent. We apply our
technique to the area of data-driven stochastic optimal control, and
demonstrate our proposed approach on a linear regulation problem for comparison
and on a nonlinear target tracking problem
Improving the Practicality of Model-Based Reinforcement Learning: An Investigation into Scaling up Model-Based Methods in Online Settings
This thesis is a response to the current scarcity of practical model-based control algorithms in the reinforcement learning (RL) framework. As of yet there is no consensus on how best to integrate imperfect transition models into RL whilst mitigating policy improvement instabilities in online settings. Current state-of-the-art policy learning algorithms that surpass human performance often rely on model-free approaches that enjoy unmitigated sampling of transition data. Model-based RL (MBRL) instead attempts to distil experience into transition models that allow agents to plan new policies without needing to return to the environment and sample more data. The initial focus of this investigation is on kernel conditional mean embeddings (CMEs) (Song et al., 2009) deployed in an approximate policy iteration (API) algorithm (Grünewälder et al., 2012a). This existing MBRL algorithm boasts theoretically stable policy updates in continuous state and discrete action spaces. The Bellman operator’s value function and (transition) conditional expectation are modelled and embedded respectively as functions in a reproducing kernel Hilbert space (RKHS). The resulting finite-induced approximate pseudo-MDP (Yao et al., 2014a) can be solved exactly in a dynamic programming algorithm with policy improvement suboptimality guarantees. However model construction and policy planning scale cubically and quadratically respectively with the training set size, rendering the CME impractical for sampleabundant tasks in online settings. Three variants of CME API are investigated to strike a balance between stable policy updates and reduced computational complexity. The first variant models the value function and state-action representation explicitly in a parametric CME (PCME) algorithm with favourable computational complexity. However a soft conservative policy update technique is developed to mitigate policy learning oscillations in the planning process. The second variant returns to the non-parametric embedding and contributes (along with external work) to the compressed CME (CCME); a sparse and computationally more favourable CME. The final variant is a fully end-to-end differentiable embedding trained with stochastic gradient updates. The value function remains modelled in an RKHS such that backprop is driven by a non-parametric RKHS loss function. Actively compressed CME (ACCME) satisfies the pseudo-MDP contraction constraint using a sparse softmax activation function. The size of the pseudo-MDP (i.e. the size of the embedding’s last layer) is controlled by sparsifying the last layer weight matrix by extending the truncated gradient method (Langford et al., 2009) with group lasso updates in a novel ‘use it or lose it’ neuron pruning mechanism. Surprisingly this technique does not require extensive fine-tuning between control tasks
A New Distribution-Free Concept for Representing, Comparing, and Propagating Uncertainty in Dynamical Systems with Kernel Probabilistic Programming
This work presents the concept of kernel mean embedding and kernel
probabilistic programming in the context of stochastic systems. We propose
formulations to represent, compare, and propagate uncertainties for fairly
general stochastic dynamics in a distribution-free manner. The new tools enjoy
sound theory rooted in functional analysis and wide applicability as
demonstrated in distinct numerical examples. The implication of this new
concept is a new mode of thinking about the statistical nature of uncertainty
in dynamical systems
- …