    Intelligent Control of a Sensor-Actuator System via Kernelized Least-Squares Policy Iteration

    In this paper a new framework, called Compressive Kernelized Reinforcement Learning (CKRL), for computing near-optimal policies in sequential decision making with uncertainty is proposed via incorporating the non-adaptive data-independent Random Projections and nonparametric Kernelized Least-squares Policy Iteration (KLSPI). Random Projections are a fast, non-adaptive dimensionality reduction framework in which high-dimensionality data is projected onto a random lower-dimension subspace via spherically random rotation and coordination sampling. KLSPI introduce kernel trick into the LSPI framework for Reinforcement Learning, often achieving faster convergence and providing automatic feature selection via various kernel sparsification approaches. In this approach, policies are computed in a low-dimensional subspace generated by projecting the high-dimensional features onto a set of random basis. We first show how Random Projections constitute an efficient sparsification technique and how our method often converges faster than regular LSPI, while at lower computational costs. Theoretical foundation underlying this approach is a fast approximation of Singular Value Decomposition (SVD). Finally, simulation results are exhibited on benchmark MDP domains, which confirm gains both in computation time and in performance in large feature spaces

    Improving the Practicality of Model-Based Reinforcement Learning: An Investigation into Scaling up Model-Based Methods in Online Settings

    This thesis is a response to the current scarcity of practical model-based control algorithms in the reinforcement learning (RL) framework. As of yet there is no consensus on how best to integrate imperfect transition models into RL whilst mitigating policy improvement instabilities in online settings. Current state-of-the-art policy learning algorithms that surpass human performance often rely on model-free approaches that enjoy unmitigated sampling of transition data. Model-based RL (MBRL) instead attempts to distil experience into transition models that allow agents to plan new policies without needing to return to the environment and sample more data. The initial focus of this investigation is on kernel conditional mean embeddings (CMEs) (Song et al., 2009) deployed in an approximate policy iteration (API) algorithm (GrĂŒnewĂ€lder et al., 2012a). This existing MBRL algorithm boasts theoretically stable policy updates in continuous state and discrete action spaces. The Bellman operator’s value function and (transition) conditional expectation are modelled and embedded respectively as functions in a reproducing kernel Hilbert space (RKHS). The resulting finite-induced approximate pseudo-MDP (Yao et al., 2014a) can be solved exactly in a dynamic programming algorithm with policy improvement suboptimality guarantees. However model construction and policy planning scale cubically and quadratically respectively with the training set size, rendering the CME impractical for sampleabundant tasks in online settings. Three variants of CME API are investigated to strike a balance between stable policy updates and reduced computational complexity. The first variant models the value function and state-action representation explicitly in a parametric CME (PCME) algorithm with favourable computational complexity. However a soft conservative policy update technique is developed to mitigate policy learning oscillations in the planning process. The second variant returns to the non-parametric embedding and contributes (along with external work) to the compressed CME (CCME); a sparse and computationally more favourable CME. The final variant is a fully end-to-end differentiable embedding trained with stochastic gradient updates. The value function remains modelled in an RKHS such that backprop is driven by a non-parametric RKHS loss function. Actively compressed CME (ACCME) satisfies the pseudo-MDP contraction constraint using a sparse softmax activation function. The size of the pseudo-MDP (i.e. the size of the embedding’s last layer) is controlled by sparsifying the last layer weight matrix by extending the truncated gradient method (Langford et al., 2009) with group lasso updates in a novel ‘use it or lose it’ neuron pruning mechanism. Surprisingly this technique does not require extensive fine-tuning between control tasks

    On-line policy learning and adaptation for real-time personalization of an artificial pancreas

    The dynamic complexity of the glucose-insulin metabolism in diabetic patients is the main obstacle towards widespread use of an artificial pancreas. The significant level of subject-specific glycemic variability requires continuously adapting the control policy to successfully face daily changes in patientŽs metabolism and lifestyle. In this paper, an on-line selective reinforcement learning algorithm that enables real-time adaptation of a control policy based on ongoing interactions with the patient so as to tailor the artificial pancreas is proposed. Adaptation includes two online procedures: on-line sparsification and parameter updating of the Gaussian process used to approximate the control policy. With the proposed sparsification method, the support data dictionary for on-line learning is modified by checking if in the arriving data stream there exists novel information to be added to the dictionary in order to personalize the policy. Results obtained in silico experiments demonstrate that on-line policy learning is both safe and efficient for maintaining blood glucose variability within the normoglycemic range.Fil: de Paula, Mariano. Universidad Nacional del Centro de la Provincia de Buenos Aires. Facultad de Ingeniería Olavarria. Departamento de Electromecånica. Grupo INTELYMEC; Argentina. Universidad Nacional del Centro de la Pcia.de Bs.as.. Centro de Investigaciones En Fisica E Ingenieria del Centro de la Provincia de Buenos Aires. - Consejo Nacional de Investigaciones Cientificas y Tecnicas. Centro Cientifico Tecnologico Conicet - Tandil. Centro de Investigaciones En Fisica E Ingenieria del Centro de la Provincia de Buenos Aires. - Provincia de Buenos Aires. Gobernacion. Comision de Invest.cientificas. Centro de Investigaciones En Fisica E Ingenieria del Centro de la Provincia de Buenos Aires; ArgentinaFil: Acosta, Gerardo Gabriel. Universidad Nacional del Centro de la Provincia de Buenos Aires. Facultad de Ingenieria Olavarria; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Martinez, Ernesto Carlos. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Desarrollo y Diseño. Universidad Tecnológica Nacional. Facultad Regional Santa Fe. Instituto de Desarrollo y Diseño; Argentin

    Deep Bayesian Quadrature Policy Optimization

    We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods are less scalable due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance

    Machine Learning through Exploration for Perception-Driven Robotics

    The ability of robots to perform tasks in human environments has largely been limited to rather simple and specific tasks, such as lawn mowing and vacuum cleaning. As such, current robots are far away from the robot butlers, assistants, and housekeepers that are depicted in science fiction movies. Part of this gap can be explained by the fact that human environments are hugely varied, complex and unstructured. For example, the homes that a domestic robot might end up in are hugely varied. Since every home has a different layout with different objects and furniture, it is impossible for a human designer to anticipate all challenges a robot might face, and equip the robot a priori with all the necessary perceptual and manipulation skills. Instead, robots could be programmed in a way that allows them to adapt to any environment that they are in. In that case, the robot designer would not need to precisely anticipate such environments. The ability to adapt can be provided by robot learning techniques, which can be applied to learn skills for perception and manipulation. Many of the current robot learning techniques, however, rely on human supervisors to provide annotations or demonstrations, and to fine-tuning the methods parameters and heuristics. As such, it can require a significant amount of human time investment to make a robot perform a task in a novel environment, even if statistical learning techniques are used. In this thesis, I focus on another way of obtaining the data a robot needs to learn about the environment and how to successfully perform skills in it. By exploring the environment using its own sensors and actuators, rather than passively waiting for annotations or demonstrations, a robot can obtain this data by itself. I investigate multiple approaches that allow a robot to explore its environment autonomously, while trying to minimize the design effort required to deploy such algorithms in different situations. First, I consider an unsupervised robot with minimal prior knowledge about its environment. It can only learn through observed sensory feedback obtained though interactive exploration of its environment. In a bottom-up, probabilistic approach, the robot tries to segment the objects in its environment through clustering with minimal prior knowledge. This clustering is based on static visual scene features and observed movement. Information theoretic principles are used to autonomously select actions that maximize the expected information gain, and thus learning speed. Our evaluations on a real robot system equipped with an on-board camera show that the proposed method handles noisy inputs better than previous methods, and that action selection according to the information gain criterion does increase the learning speed. Often, however, the goal of a robot is not just to learn the structure of the environment, but to learn how to perform a task encoded by a reward signal. In addition to the weak feedback provided by reward signals, the robot has access to rich sensory data, that, even for simple tasks, is often non-linear and high-dimensional. Sensory data can be leveraged to learn a system model, but in high-dimensional sensory spaces this step often requires manually designing features. I propose a robot reinforcement learning algorithm with learned non-parametric models, value functions, and policies that can deal with high-dimensional state representations. As such, the proposed algorithm is well-suited to deal with high-dimensional signals such as camera images. To avoid that the robot converges prematurely to a sub-optimal solution, the information loss of policy updates is limited. This constraint makes sure the robot keeps exploring the effects of its behavior on the environment. The experiments show that the proposed non-parametric relative entropy policy search algorithm performs better than prior methods that either do not employ bounded updates, or that try to cover the state-space with general-purpose radial basis functions. Furthermore, the method is validated on a real-robot setup with high-dimensional camera image inputs. One problem with typical exploration strategies is that the behavior is perturbed independently in each time step, for example through selecting a random action or random policy parameters. As such, the resulting exploration behavior might be incoherent. Incoherence causes inefficient random walk behavior, makes the system less robust, and causes wear and tear on the robot. A typical solution is to perturb the policy parameters directly, and use the same perturbation for an entire episode. However, this strategy tends to increase the number of episodes needed, since only a single perturbation can be evaluated per episode. I introduce a strategy that can make a more balanced trade-off between the advantages of these two approaches. The experiments show that intermediate trade-offs, rather than independent or episode-based exploration, is beneficial across different tasks and learning algorithms. This thesis thus addresses how robots can learn autonomously by exploring the world through unsupervised learning and reinforcement learning. Throughout the thesis, new approaches and algorithms are introduced: a probabilistic interactive segmentation approach, the non-parametric relative entropy policy search algorithm, and a framework for generalized exploration. To allow the learning algorithms to be applied in different and unknown environments, the design effort and supervision required from human designers or users is minimized. These approaches and algorithms contribute towards the capability of robots to autonomously learn useful skills in human environments in a practical manner

    Learning Terrain Dynamics: A Gaussian Process Modeling and Optimal Control Adaptation Framework Applied to Robotic Jumping

    The complex dynamics characterizing deformable terrain presents significant impediments toward the real-world viability of locomotive robotics, particularly for legged machines. We explore vertical, robotic jumping as a model task for legged locomotion on presumed-uncharacterized, nonrigid terrain. By integrating Gaussian process (GP)-based regression and evaluation to estimate ground reaction forces as a function of the state, a 1-D jumper acquires the capability to learn forcing profiles exerted by its environment in tandem with achieving its control objective. The GP-based dynamical model initially assumes a baseline rigid, noncompliant surface. As part of an iterative procedure, the optimizer employing this model generates an optimal control strategy to achieve a target jump height. Experiential data recovered from execution on the true surface model are applied to train the GP, in turn, providing the optimizer a more richly informed dynamical model of the environment. The iterative control-learning procedure was rigorously evaluated in experiment, over different surface types, whereby a robotic hopper was challenged to jump to several different target heights. Each task was achieved within ten attempts, over which the terrain's dynamics were learned. With each iteration, GP predictions of ground forcing became incrementally refined, rapidly matching experimental force measurements. The few-iteration convergence demonstrates a fundamental capacity to both estimate and adapt to unknown terrain dynamics in application-realistic time scales, all with control tools amenable to robotic legged locomotion

    Using Mean Embeddings for State Estimation and Reinforcement Learning

    To act in complex, high-dimensional environments, autonomous systems require versatile state estimation techniques and compact state representations. State estimation is crucial when the system only has access to stochastic measurements or partial observations. Furthermore, in combination with models of the system such techniques allow to predict the future which enables the system to asses the outcome of possible decisions. Compact state representations alleviate the curse of dimensionality by distilling the important information from high-dimensional observations. Due to noisy sensory information and non-perfect models of the system, estimates of the state never reflect the true state perfectly but are always subject to errors. The natural choice to incorporate the uncertainty about the state estimate is to use a probability distribution as representation. This results in the so called belief state. High-dimensional observations, for example images, often contain much less information than conveyed by their dimensionality. But also if all the information is necessary to describe the state of the system—for example, think of the state of a swarm with the positions of all agents—a less complex description might be a sufficient representation. In such situations, finding the generative distribution that explains the state would give a much more compact while informative representation. Traditionally, parametric distributions have been used as state representations such as most prevalently the Gaussian distribution. However, in many cases a unimodal distribution might not be sufficient to represent the belief state. Using multi-modal probability distributions, instead, requires more advanced approaches such as mixture models or particle-based Monte Carlo methods. Learning mixture models is however not straight-forward and often results in locally optimal solutions. Similarly, maintaining a good population of particles during inference is a complicated and cumbersome process. A third approach is kernel density estimation which is located at the intersection of mixture models and particle-based approaches. Still, performing inference with any of these approaches requires heuristics that lead to poor performance and a limited scalability to higher dimensional spaces. A recent technique that alleviates this problem are the embeddings of probability distributions into reproducing kernel Hilbert spaces (RKHS). Conditional distributions can be embedded as operators based on which a framework for inference has been presented that allows to apply the sum rule, the product rule and Bayes’ rule entirely in Hilbert space. Using sample based estimators and the kernel-trick of the representer theorem allows to represent the operations as vector-matrix manipulations. The contributions of this thesis are based on or inspired by the embeddings of distributions into reproducing kernel Hilbert spaces. In the first part of this thesis, I propose additions to the framework for nonparametric inference that allow the inference operators to scale more gracefully with the number of samples in the training set. The first contribution is an alternative approach to the conditional embedding operator formulated as a least-squares problem i which allows to use only a subset of the data as representation while using the full data set to learn the conditional operator. I call this operator the subspace conditional embedding operator. Inspired by the least-squares derivations of the Kalman filter, I furthermore propose an alternative operator for Bayesian updates in Hilbert space, the kernel Kalman rule. This alternative approach is numerically more robust than the kernel Bayes rule presented in the framework for non-parametric inference and scales better with the number of samples. Based on the kernel Kalman rule, I derive the kernel Kalman filter and the kernel forward-backward smoother to perform state estimation, prediction and smoothing based on Hilbert space embeddings of the belief state. This representation is able to capture multi-modal distributions and inference resolves--due to the kernel trick--into easy matrix manipulations. In the second part of this thesis, I propose a representation for large sets of homogeneous observations. Specifically, I consider the problem of learning a controller for object assembly and object manipulation with a robotic swarm. I assume a swarm of homogeneous robots that are controlled by a common input signal, e.g., the gradient of a light source or a magnetic field. Learning policies for swarms is a challenging problem since the state space grows with the number of agents and becomes quickly very high dimensional. Furthermore, the exact number of agents and the order of the agents in the observation is not important to solve the task. To approach this issue, I propose the swarm kernel which uses a Hilbert space embedding to represent the swarm. Instead of the exact positions of the agents in the swarm, the embedding estimates the generative distribution behind the swarm configuration. The specific agent positions are regarded as samples of this distribution. Since the swarm kernel compares the embeddings of distributions, it can compare swarm configurations with varying numbers of individuals and is invariant to the permutation of the agents. I present a hierarchical approach for solving the object manipulation task where I assume a high-level object assembly policy as given. To learn the low-level object pushing policy, I use the swarm kernel with an actor-critic policy search method. The policies which I learn in simulation can be directly transferred to a real robotic system. In the last part of this thesis, I investigate how we can employ the idea of kernel mean embeddings to deep reinforcement learning. As in the previous part, I consider a variable number of homogeneous observations—such as robot swarms where the number of agents can change. Another example is the representation of 3D structures as point clouds. The number of points in such clouds can vary strongly and the order of the points in a vectorized representation is arbitrary. The common architectures for neural networks have a fixed structure that requires that the dimensionality of inputs and outputs is known in advance. A variable number of inputs can only be processed by applying tricks. To approach this problem, I propose the deep M-embeddings which are inspired by the kernel mean embeddings. The deep M-embeddings provide a network structure to compute a fixed length representation from a variable number of inputs. Additionally, the deep M-embeddings exploit the homogeneous nature of the inputs to reduce the number of parameters in the network and, thus, make the learning easier. Similar to the swarm kernel, the policies learned with the deep M-embeddings can be transferred to different swarm sizes and different number of objects in the environment without further learning

    Two-Phase Iteration for Value Function Approximation and Hyperparameter Optimization in Gaussian-Kernel-Based Adaptive Critic Design

    Adaptive Dynamic Programming (ADP) with critic-actor architecture is an effective way to perform online learning control. To avoid the subjectivity in the design of a neural network that serves as a critic network, kernel-based adaptive critic design (ACD) was developed recently. There are two essential issues for a static kernel-based model: how to determine proper hyperparameters in advance and how to select right samples to describe the value function. They all rely on the assessment of sample values. Based on the theoretical analysis, this paper presents a two-phase simultaneous learning method for a Gaussian-kernel-based critic network. It is able to estimate the values of samples without infinitively revisiting them. And the hyperparameters of the kernel model are optimized simultaneously. Based on the estimated sample values, the sample set can be refined by adding alternatives or deleting redundances. Combining this critic design with actor network, we present a Gaussian-kernel-based Adaptive Dynamic Programming (GK-ADP) approach. Simulations are used to verify its feasibility, particularly the necessity of two-phase learning, the convergence characteristics, and the improvement of the system performance by using a varying sample set

    Modelling transition dynamics in MDPs with RKHS embeddings

    We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.Comment: ICML201
