3,833 research outputs found

    Geometric Insights into the Convergence of Nonlinear TD Learning

    Full text link
    While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence.Comment: ICLR 202

    Sparse Inverse Problems Over Measures: Equivalence of the Conditional Gradient and Exchange Methods

    Full text link
    We study an optimization program over nonnegative Borel measures that encourages sparsity in its solution. Efficient solvers for this program are in increasing demand, as it arises when learning from data generated by a `continuum-of-subspaces' model, a recent trend with applications in signal processing, machine learning, and high-dimensional statistics. We prove that the conditional gradient method (CGM) applied to this infinite-dimensional program, as proposed recently in the literature, is equivalent to the exchange method (EM) applied to its Lagrangian dual, which is a semi-infinite program. In doing so, we formally connect such infinite-dimensional programs to the well-established field of semi-infinite programming. On the one hand, the equivalence established in this paper allows us to provide a rate of convergence for EM which is more general than those existing in the literature. On the other hand, this connection and the resulting geometric insights might in the future lead to the design of improved variants of CGM for infinite-dimensional programs, which has been an active research topic. CGM is also known as the Frank-Wolfe algorithm

    Stochastic Gradient Based Extreme Learning Machines For Online Learning of Advanced Combustion Engines

    Full text link
    In this article, a stochastic gradient based online learning algorithm for Extreme Learning Machines (ELM) is developed (SG-ELM). A stability criterion based on Lyapunov approach is used to prove both asymptotic stability of estimation error and stability in the estimated parameters suitable for identification of nonlinear dynamic systems. The developed algorithm not only guarantees stability, but also reduces the computational demand compared to the OS-ELM approach based on recursive least squares. In order to demonstrate the effectiveness of the algorithm on a real-world scenario, an advanced combustion engine identification problem is considered. The algorithm is applied to two case studies: An online regression learning for system identification of a Homogeneous Charge Compression Ignition (HCCI) Engine and an online classification learning (with class imbalance) for identifying the dynamic operating envelope of the HCCI Engine. The results indicate that the accuracy of the proposed SG-ELM is comparable to that of the state-of-the-art but adds stability and a reduction in computational effort.Comment: This paper was written as an extract from my PhD thesis (July 2013) and so references may not be to date as of this submission (Jan 2015). The article is in review and contains 10 figures, 35 reference

    A Simulation-Based Approach to Stochastic Dynamic Programming

    Get PDF
    In this paper we develop a simulation-based approach to stochastic dynamic programming. To solve the Bellman equation we construct Monte Carlo estimates of Q-values. Our method is scalable to high dimensions and works in both continuous and discrete state and decision spaces whilst avoiding discretization errors that plague traditional methods. We provide a geometric convergence rate. We illustrate our methodology with a dynamic stochastic investment problem. Keywords

    Adaptive FISTA for Non-convex Optimization

    Full text link
    In this paper we propose an adaptively extrapolated proximal gradient method, which is based on the accelerated proximal gradient method (also known as FISTA), however we locally optimize the extrapolation parameter by carrying out an exact (or inexact) line search. It turns out that in some situations, the proposed algorithm is equivalent to a class of SR1 (identity minus rank 1) proximal quasi-Newton methods. Convergence is proved in a general non-convex setting, and hence, as a byproduct, we also obtain new convergence guarantees for proximal quasi-Newton methods. The efficiency of the new method is shown in numerical experiments on a sparsity regularized non-linear inverse problem

    Think globally, fit locally under the Manifold Setup: Asymptotic Analysis of Locally Linear Embedding

    Full text link
    Since its introduction in 2000, the locally linear embedding (LLE) has been widely applied in data science. We provide an asymptotical analysis of the LLE under the manifold setup. We show that for the general manifold, asymptotically we may not obtain the Laplace-Beltrami operator, and the result may depend on the non-uniform sampling, unless a correct regularization is chosen. We also derive the corresponding kernel function, which indicates that the LLE is not a Markov process. A comparison with the other commonly applied nonlinear algorithms, particularly the diffusion map, is provided, and its relationship with the locally linear regression is also discussed.Comment: 78 pages, 4 figures. We add a short discussion about thr relation between espilon and the intrinsic geometry of the manifold. We add a new section about K nearest neighborhood (KNN) and a new subsection about error in variable. We provide more numerical example

    Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

    Full text link
    We analyze algorithms for approximating a function f(x)=Φxf(x) = \Phi x mapping d\Re^d to d\Re^d using deep linear neural networks, i.e. that learn a function hh parameterized by matrices Θ1,...,ΘL\Theta_1,...,\Theta_L and defined by h(x)=ΘLΘL1...Θ1xh(x) = \Theta_L \Theta_{L-1} ... \Theta_1 x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix Φ\Phi, in the case where the initial hypothesis Θ1=...=ΘL=I\Theta_1 = ... = \Theta_L = I has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for Φ\Phi whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ\Phi is symmetric positive definite, we show that an algorithm that initializes Θi=I\Theta_i = I learns an ϵ\epsilon-approximation of ff using a number of updates polynomial in LL, the condition number of Φ\Phi, and log(d/ϵ)\log(d/\epsilon). In contrast, we show that if the least squares matrix Φ\Phi is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ\Phi satisfies uΦu>0u^{\top} \Phi u > 0 for all uu, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant uΘLΘL1...Θ1u>0u^{\top} \Theta_L \Theta_{L-1} ... \Theta_1 u > 0 for all uu, and another that "balances" Θ1,...,ΘL\Theta_1, ..., \Theta_L so that they have the same singular values

    Optimization Methods for Large-Scale Machine Learning

    Full text link
    This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations

    Visualizing the Effects of a Changing Distance on Data Using Continuous Embeddings

    Full text link
    Most Machine Learning (ML) methods, from clustering to classification, rely on a distance function to describe relationships between datapoints. For complex datasets it is hard to avoid making some arbitrary choices when defining a distance function. To compare images, one must choose a spatial scale, for signals, a temporal scale. The right scale is hard to pin down and it is preferable when results do not depend too tightly on the exact value one picked. Topological data analysis seeks to address this issue by focusing on the notion of neighbourhood instead of distance. It is shown that in some cases a simpler solution is available. It can be checked how strongly distance relationships depend on a hyperparameter using dimensionality reduction. A variant of dynamical multi-dimensional scaling (MDS) is formulated, which embeds datapoints as curves. The resulting algorithm is based on the Concave-Convex Procedure (CCCP) and provides a simple and efficient way of visualizing changes and invariances in distance patterns as a hyperparameter is varied. A variant to analyze the dependence on multiple hyperparameters is also presented. A cMDS algorithm that is straightforward to implement, use and extend is provided. To illustrate the possibilities of cMDS, cMDS is applied to several real-world data sets.Comment: This is manuscript is accepted for publication in 'Computational Statistics and Data Analysis

    Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework

    Full text link
    We approach the continuous-time mean-variance (MV) portfolio selection with reinforcement learning (RL). The problem is to achieve the best tradeoff between exploration and exploitation, and is formulated as an entropy-regularized, relaxed stochastic control problem. We prove that the optimal feedback policy for this problem must be Gaussian, with time-decaying variance. We then establish connections between the entropy-regularized MV and the classical MV, including the solvability equivalence and the convergence as exploration weighting parameter decays to zero. Finally, we prove a policy improvement theorem, based on which we devise an implementable RL algorithm. We find that our algorithm outperforms both an adaptive control based method and a deep neural networks based algorithm by a large margin in our simulations.Comment: 39 pages, 5 figure
    corecore