    Regularized approximate policy iteration using kernel for on-line reinforcement learning

    By using Reinforcement Learning (RL), an autonomous agent interacting with the environment can learn how to take adequate actions for every situation in order to optimally achieve its own goal. RL provides a general methodology able to solve uncertain and complex decision problems which may be present in many real-world applications. RL problems are usually modeled as a Markov Decision Processes (MDPs) deeply studied in the literature. The main peculiarity of a RL algorithm is that the RL agent is assumed to learn the optimal policies from its experiences without knowing the parameters of the MDP. The key element in solving the MDP is learning a value function which gives the expectation of total reward an agent might expect at its current state taking a given action. This value function allows to obtain the optimal policy. In this thesis we study the capacity of SVR using kernel methods to adapt and solve complex RL problems in large or continuous state space. SVR can be studied using a geometrical interpretation in terms of optimal margin or can be seen as a regularization problem given in a Reproducing Kernel Hilbert Space (RKHS) SVR have good properties over the generalization ability and as they are based a on convex optimization problem, they do not suffer from sub-optimality. SVR are non-parametric showing the ability to automatically adapt to the complexity of the problem. Accordingly, applying SVR to approximate value functions sounds to be a good approach. SVR can be solved both in batch mode when the whole set of training sample are at disposal of the learning agents or incrementally which enables the addition or removal of training samples very effectively. Incremental SVR finds the appropriate KKT conditions for new or updated data by modifying their influences into the regression function maintaining consistence in the KKT conditions for the rest of data used for learning. In RL problems an incremental SVR should be able to approximate the action value function leading to the optimal policy. Accordingly, computation load should be lower, learning speed faster and generalization more effective than other existing method The overall contribution coming from of our work is to develop, formalize, implement and study a new RL technique for generalization in discrete and continuous state spaces with finite actions. Our method uses the Approximate Policy Iteration (API) framework with the BRM criterion which allows to represent the action value function using SVR. This approach for RL is the first one we know using SVR compatible to the agent interaction- with-the-environment framework of RL which shows his power by solving a large number of benchmark problems, including very difficult ones, like the bicycle driving and riding control problem. In addition, unlike most RL approaches to generalization, we develop a proof finding theoretical bounds for the convergence of the method to the optimal solution under given conditions.Mediante el uso de aprendizaje por refuerzo (RL), un agente aut贸nomo interactuando con el medio ambiente puede aprender a tomar adecuada acciones para cada situaci贸n con el fin de lograr de manera 贸ptima su propia meta. RL proporciona una metodolog铆a general capaz de resolver problemas de decisi贸n complejos que pueden estar presentes en muchas aplicaciones del mundo real. Problemas RL usualmente se modelan como una Procesos de Decisi贸n de Markov (MDP) estudiados profundamente en la literatura. La principal peculiaridad de un algoritmo de RL es que el agente es asumido para aprender las pol铆ticas 贸ptimas de sus experiencias sin saber los par谩metros de la MDP. El elemento clave en resolver el MDP est谩 en el aprender una funci贸n de valor que da la expectativa de recompensa total que un agente puede esperar en su estado actual para tomar una acci贸n determinada. Esta funci贸n de valor permite obtener la pol铆tica 贸ptima. En esta tesis se estudia la capacidad del SVR utilizando n煤cleo m茅todos para adaptarse y resolver problemas RL complejas en el espacio estado grande o continua. RVS puede ser estudiado mediante un interpretaci贸n geom茅trica en t茅rminos de margen 贸ptimo o puede ser visto como un problema de regularizaci贸n dado en un Reproducing Kernel Hilbert Space (RKHS). SVR tiene buenas propiedades sobre la capacidad de generalizaci贸n y ya que se basan en una optimizaci贸n convexa problema, ellos no sufren de sub-optimalidad. SVR son no param茅trico que muestra la capacidad de adaptarse autom谩ticamente a la complejidad del problema. En consecuencia, la aplicaci贸n de RVS para aproximar funciones de valor suena para ser un buen enfoque. SVR puede resolver tanto en modo batch cuando todo el conjunto de muestra de entrenamiento est谩n a disposici贸n de los agentes de aprendizaje o incrementalmente que permite la adici贸n o eliminaci贸n de muestras de entrenamiento muy eficaz. Incremental SVR encuentra las condiciones adecuadas para KKT nuevas o actualizadas de datos modificando sus influencias en la funci贸n de regresi贸n mantener consistencia en las condiciones KKT para el resto de los datos utilizados para el aprendizaje. En los problemas de RL una RVS elemental ser谩 capaz de aproximar la funci贸n de valor de acci贸n que conduce a la pol铆tica 贸ptima. En consecuencia, la carga de c谩lculo deber铆a ser menor, la velocidad de aprendizaje m谩s r谩pido y generalizaci贸n m谩s efectivo que el otro m茅todo existente La contribuci贸n general que viene de nuestro trabajo es desarrollar, formalizar, ejecutar y estudiar una nueva t茅cnica de RL para la generalizaci贸n en espacio de estados discretos y continuos con acciones finitas. Nuestro m茅todo utiliza el marco de la Approximate Policy Iteration (API) con el criterio de BRM que permite representar la funci贸n de valor de acci贸n utilizando SVR. Este enfoque de RL es el primero que conocemos usando SVR compatible con el marco de RL con agentes interaccionado con el ambiente que muestra su poder mediante la resoluci贸n de un gran n煤mero de problemas de referencia, incluyendo los muy dif铆ciles, como la conducci贸n de bicicletas y problema de control de conducci贸n. Adem谩s, a diferencia de la mayor铆a RL se acerca a la generalizaci贸n, desarrollamos un hallazgo prueba l铆mites te贸ricos para la convergencia del m茅todo a la soluci贸n 贸ptima en condiciones dadas.Postprint (published version

    Maximum Moment Restriction for Instrumental Variable Regression

    We propose a simple framework for nonlinear instrumental variable (IV) regression based on a kernelized conditional moment restriction (CMR) known as a maximum moment restriction (MMR). The MMR is formulated by maximizing the interaction between the residual and the instruments belonging to a unit ball in a reproducing kernel Hilbert space (RKHS). The MMR allows us to reformulate the IV regression as a single-step empirical risk minimization problem, where the risk depends on the reproducing kernel on the instrument and can be estimated by a U-statistic or V-statistic. This simplification not only eases the proofs of consistency and asymptotic normality in both parametric and non-parametric settings, but also results in easy-to-use algorithms with an efficient hyper-parameter selection procedure. We demonstrate the advantages of our framework over existing ones using experiments on both synthetic and real-world data.Comment: 34 page

    Methods for Optimization and Regularization of Generative Models

    This thesis studies the problem of regularizing and optimizing generative models, often using insights and techniques from kernel methods. The work proceeds in three main themes. Conditional score estimation. We propose a method for estimating conditional densities based on a rich class of RKHS exponential family models. The algorithm works by solving a convex quadratic problem for fitting the gradient of the log density, the score, thus avoiding the need for estimating the normalizing constant. We show the resulting estimator to be consistent and provide convergence rates when the model is well-specified. Structuring and regularizing implicit generative models. In a first contribution, we introduce a method for learning Generative Adversarial Networks, a class of Implicit Generative Models, using a parametric family of Maximum Mean Discrepancies (MMD). We show that controlling the gradient of the critic function defining the MMD is vital for having a sensible loss function. Moreover, we devise a method to enforce exact, analytical gradient constraints. As a second contribution, we introduce and study a new generative model suited for data with low intrinsic dimension embedded in a high dimensional space. This model combines two components: an implicit model, which can learn the low-dimensional support of data, and an energy function, to refine the probability mass by importance sampling on the support of the implicit model. We further introduce algorithms for learning such a hybrid model and for efficient sampling. Optimizing implicit generative models. We first study the Wasserstein gradient flow of the Maximum Mean Discrepancy in a non-parametric setting and provide smoothness conditions on the trajectory of the flow to ensure global convergence. We identify cases when this condition does not hold and propose a new algorithm based on noise injection to mitigate this problem. In a second contribution, we consider the Wasserstein gradient flow of generic loss functionals in a parametric setting. This flow is invariant to the model's parameterization, just like the Fisher gradient flows in information geometry. It has the additional benefit to be well defined even for models with varying supports, which is particularly well suited for implicit generative models. We then introduce a general framework for approximating the Wasserstein natural gradient by leveraging a dual formulation of the Wasserstein pseudo-Riemannian metric that we restrict to a Reproducing Kernel Hilbert Space. The resulting estimator is scalable and provably consistent as it relies on Nystrom methods

    Stochastic Optimization For Multi-Agent Statistical Learning And Control

    The goal of this thesis is to develop a mathematical framework for optimal, accurate, and affordable complexity statistical learning among networks of autonomous agents. We begin by noting the connection between statistical inference and stochastic programming, and consider extensions of this setup to settings in which a network of agents each observes a local data stream and would like to make decisions that are good with respect to information aggregated across the entire network. There is an open-ended degree of freedom in this problem formulation, however: the selection of the estimator function class which defines the feasible set of the stochastic program. Our central contribution is the design of stochastic optimization tools in reproducing kernel Hilbert spaces that yield optimal, accurate, and affordable complexity statistical learning for a multi-agent network. To obtain this result, we first explore the relative merits and drawbacks of different function class selections. In Part I, we consider multi-agent expected risk minimization this problem setting for the case that each agent seems to learn a common globally optimal generalized linear models (GLMs) by developing a stochastic variant of Arrow-Hurwicz primal-dual method. We establish convergence to the primal-dual optimal pair when either consensus or ``proximity constraints encode the fact that we want all agents\u27 to agree, or nearby agents to make decisions that are close to one another. Empirically, we observe that these convergence results are substantiated but that convergence may not translate into statistical accuracy. More broadly, optimality within a given estimator function class is not the same as one that makes minimal inference errors. The optimality-accuracy tradeoff of GLMs motivates subsequent efforts to learn more sophisticated estimators based upon learned feature encodings of the data that is fed into the statistical model. The specific tool we turn to in Part II is dictionary learning, where we optimize both over regression weights and an encoding of the data, which yields a non-convex problem. We investigate the use of stochastic methods for online task-driven dictionary learning, and obtain promising performance for the task of a ground robot learning to anticipate control uncertainty based on its past experience. Heartened by this implementation, we then consider extensions of this framework for a multi-agent network to each learn globally optimal task-driven dictionaries based on stochastic primal-dual methods. However, it is here the non-convexity of the optimization problem causes problems: stringent conditions on stochastic errors and the duality gap limit the applicability of the convergence guarantees, and impractically small learning rates are required for convergence in practice. Thus, we seek to learn nonlinear statistical models while preserving convexity, which is possible through kernel methods ( Part III). However, the increased descriptive power of nonparametric estimation comes at the cost of infinite complexity. Thus, we develop a stochastic approximation algorithm in reproducing kernel Hilbert spaces (RKHS) that ameliorates this complexity issue while preserving optimality: we combine the functional generalization of stochastic gradient method (FSGD) with greedily constructed low-dimensional subspace projections based on matching pursuit. We establish that the proposed method yields a controllable trade-off between optimality and memory, and yields highly accurate parsimonious statistical models in practice. % Then, we develop a multi-agent extension of this method by proposing a new node-separable penalty function and applying FSGD together with low-dimensional subspace projections. This extension allows a network of autonomous agents to learn a memory-efficient approximation to the globally optimal regression function based only on their local data stream and message passing with neighbors. In practice, we observe agents are able to stably learn highly accurate and memory-efficient nonlinear statistical models from streaming data. From here, we shift focus to a more challenging class of problems, motivated by the fact that true learning is not just revising predictions based upon data but augmenting behavior over time based on temporal incentives. This goal may be described by Markov Decision Processes (MDPs): at each point, an agent is in some state of the world, takes an action and then receives a reward while randomly transitioning to a new state. The goal of the agent is to select the action sequence to maximize its long-term sum of rewards, but determining how to select this action sequence when both the state and action spaces are infinite has eluded researchers for decades. As a precursor to this feat, we consider the problem of policy evaluation in infinite MDPs, in which we seek to determine the long-term sum of rewards when starting in a given state when actions are chosen according to a fixed distribution called a policy. We reformulate this problem as a RKHS-valued compositional stochastic program and we develop a functional extension of stochastic quasi-gradient algorithm operating in tandem with the greedy subspace projections mentioned above. We prove convergence with probability 1 to the Bellman fixed point restricted to this function class, and we observe a state of the art trade off in memory versus Bellman error for the proposed method on the Mountain Car driving task, which bodes well for incorporating policy evaluation into more sophisticated, provably stable reinforcement learning techniques, and in time, developing optimal collaborative multi-agent learning-based control systems

    Sparse Learning for Variable Selection with Structures and Nonlinearities

    In this thesis we discuss machine learning methods performing automated variable selection for learning sparse predictive models. There are multiple reasons for promoting sparsity in the predictive models. By relying on a limited set of input variables the models naturally counteract the overfitting problem ubiquitous in learning from finite sets of training points. Sparse models are cheaper to use for predictions, they usually require lower computational resources and by relying on smaller sets of inputs can possibly reduce costs for data collection and storage. Sparse models can also contribute to better understanding of the investigated phenomenons as they are easier to interpret than full models.Comment: PhD thesi

