3,833 research outputs found
Geometric Insights into the Convergence of Nonlinear TD Learning
While there are convergence guarantees for temporal difference (TD) learning
when using linear function approximators, the situation for nonlinear models is
far less understood, and divergent examples are known. Here we take a first
step towards extending theoretical convergence guarantees to TD learning with
nonlinear function approximation. More precisely, we consider the expected
learning dynamics of the TD(0) algorithm for value estimation. As the step-size
converges to zero, these dynamics are defined by a nonlinear ODE which depends
on the geometry of the space of function approximators, the structure of the
underlying Markov chain, and their interaction. We find a set of function
approximators that includes ReLU networks and has geometry amenable to TD
learning regardless of environment, so that the solution performs about as well
as linear TD in the worst case. Then, we show how environments that are more
reversible induce dynamics that are better for TD learning and prove global
convergence to the true value function for well-conditioned function
approximators. Finally, we generalize a divergent counterexample to a family of
divergent problems to demonstrate how the interaction between approximator and
environment can go wrong and to motivate the assumptions needed to prove
convergence.Comment: ICLR 202
Sparse Inverse Problems Over Measures: Equivalence of the Conditional Gradient and Exchange Methods
We study an optimization program over nonnegative Borel measures that
encourages sparsity in its solution. Efficient solvers for this program are in
increasing demand, as it arises when learning from data generated by a
`continuum-of-subspaces' model, a recent trend with applications in signal
processing, machine learning, and high-dimensional statistics. We prove that
the conditional gradient method (CGM) applied to this infinite-dimensional
program, as proposed recently in the literature, is equivalent to the exchange
method (EM) applied to its Lagrangian dual, which is a semi-infinite program.
In doing so, we formally connect such infinite-dimensional programs to the
well-established field of semi-infinite programming.
On the one hand, the equivalence established in this paper allows us to
provide a rate of convergence for EM which is more general than those existing
in the literature. On the other hand, this connection and the resulting
geometric insights might in the future lead to the design of improved variants
of CGM for infinite-dimensional programs, which has been an active research
topic. CGM is also known as the Frank-Wolfe algorithm
Stochastic Gradient Based Extreme Learning Machines For Online Learning of Advanced Combustion Engines
In this article, a stochastic gradient based online learning algorithm for
Extreme Learning Machines (ELM) is developed (SG-ELM). A stability criterion
based on Lyapunov approach is used to prove both asymptotic stability of
estimation error and stability in the estimated parameters suitable for
identification of nonlinear dynamic systems. The developed algorithm not only
guarantees stability, but also reduces the computational demand compared to the
OS-ELM approach based on recursive least squares. In order to demonstrate the
effectiveness of the algorithm on a real-world scenario, an advanced combustion
engine identification problem is considered. The algorithm is applied to two
case studies: An online regression learning for system identification of a
Homogeneous Charge Compression Ignition (HCCI) Engine and an online
classification learning (with class imbalance) for identifying the dynamic
operating envelope of the HCCI Engine. The results indicate that the accuracy
of the proposed SG-ELM is comparable to that of the state-of-the-art but adds
stability and a reduction in computational effort.Comment: This paper was written as an extract from my PhD thesis (July 2013)
and so references may not be to date as of this submission (Jan 2015). The
article is in review and contains 10 figures, 35 reference
A Simulation-Based Approach to Stochastic Dynamic Programming
In this paper we develop a simulation-based approach to stochastic dynamic programming. To solve the Bellman equation we construct Monte Carlo estimates of Q-values. Our method is scalable to high dimensions and works in both continuous and discrete state and decision spaces whilst avoiding discretization errors that plague traditional methods. We provide a geometric convergence rate. We illustrate our methodology with a dynamic stochastic investment problem. Keywords
Adaptive FISTA for Non-convex Optimization
In this paper we propose an adaptively extrapolated proximal gradient method,
which is based on the accelerated proximal gradient method (also known as
FISTA), however we locally optimize the extrapolation parameter by carrying out
an exact (or inexact) line search. It turns out that in some situations, the
proposed algorithm is equivalent to a class of SR1 (identity minus rank 1)
proximal quasi-Newton methods. Convergence is proved in a general non-convex
setting, and hence, as a byproduct, we also obtain new convergence guarantees
for proximal quasi-Newton methods. The efficiency of the new method is shown in
numerical experiments on a sparsity regularized non-linear inverse problem
Think globally, fit locally under the Manifold Setup: Asymptotic Analysis of Locally Linear Embedding
Since its introduction in 2000, the locally linear embedding (LLE) has been
widely applied in data science. We provide an asymptotical analysis of the LLE
under the manifold setup. We show that for the general manifold, asymptotically
we may not obtain the Laplace-Beltrami operator, and the result may depend on
the non-uniform sampling, unless a correct regularization is chosen. We also
derive the corresponding kernel function, which indicates that the LLE is not a
Markov process. A comparison with the other commonly applied nonlinear
algorithms, particularly the diffusion map, is provided, and its relationship
with the locally linear regression is also discussed.Comment: 78 pages, 4 figures. We add a short discussion about thr relation
between espilon and the intrinsic geometry of the manifold. We add a new
section about K nearest neighborhood (KNN) and a new subsection about error
in variable. We provide more numerical example
Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
We analyze algorithms for approximating a function mapping
to using deep linear neural networks, i.e. that learn a
function parameterized by matrices and defined by
. We focus on algorithms that
learn through gradient descent on the population quadratic loss in the case
that the distribution over the inputs is isotropic.
We provide polynomial bounds on the number of iterations for gradient descent
to approximate the least squares matrix , in the case where the initial
hypothesis has excess loss bounded by a small
enough constant. On the other hand, we show that gradient descent fails to
converge for whose distance from the identity is a larger constant, and
we show that some forms of regularization toward the identity in each layer do
not help.
If is symmetric positive definite, we show that an algorithm that
initializes learns an -approximation of using a
number of updates polynomial in , the condition number of , and
. In contrast, we show that if the least squares matrix
is symmetric and has a negative eigenvalue, then all members of a class
of algorithms that perform gradient descent with identity initialization, and
optionally regularize toward the identity in each layer, fail to converge.
We analyze an algorithm for the case that satisfies for all , but may not be symmetric. This algorithm uses two regularizers:
one that maintains the invariant for all , and another that "balances" so that
they have the same singular values
Optimization Methods for Large-Scale Machine Learning
This paper provides a review and commentary on the past, present, and future
of numerical optimization algorithms in the context of machine learning
applications. Through case studies on text classification and the training of
deep neural networks, we discuss how optimization problems arise in machine
learning and what makes them challenging. A major theme of our study is that
large-scale machine learning represents a distinctive setting in which the
stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter.
Based on this viewpoint, we present a comprehensive theory of a
straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance.
This leads to a discussion about the next generation of optimization methods
for large-scale machine learning, including an investigation of two main
streams of research on techniques that diminish noise in the stochastic
directions and methods that make use of second-order derivative approximations
Visualizing the Effects of a Changing Distance on Data Using Continuous Embeddings
Most Machine Learning (ML) methods, from clustering to classification, rely
on a distance function to describe relationships between datapoints. For
complex datasets it is hard to avoid making some arbitrary choices when
defining a distance function. To compare images, one must choose a spatial
scale, for signals, a temporal scale. The right scale is hard to pin down and
it is preferable when results do not depend too tightly on the exact value one
picked. Topological data analysis seeks to address this issue by focusing on
the notion of neighbourhood instead of distance. It is shown that in some cases
a simpler solution is available. It can be checked how strongly distance
relationships depend on a hyperparameter using dimensionality reduction. A
variant of dynamical multi-dimensional scaling (MDS) is formulated, which
embeds datapoints as curves. The resulting algorithm is based on the
Concave-Convex Procedure (CCCP) and provides a simple and efficient way of
visualizing changes and invariances in distance patterns as a hyperparameter is
varied. A variant to analyze the dependence on multiple hyperparameters is also
presented. A cMDS algorithm that is straightforward to implement, use and
extend is provided. To illustrate the possibilities of cMDS, cMDS is applied to
several real-world data sets.Comment: This is manuscript is accepted for publication in 'Computational
Statistics and Data Analysis
Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework
We approach the continuous-time mean-variance (MV) portfolio selection with
reinforcement learning (RL). The problem is to achieve the best tradeoff
between exploration and exploitation, and is formulated as an
entropy-regularized, relaxed stochastic control problem. We prove that the
optimal feedback policy for this problem must be Gaussian, with time-decaying
variance. We then establish connections between the entropy-regularized MV and
the classical MV, including the solvability equivalence and the convergence as
exploration weighting parameter decays to zero. Finally, we prove a policy
improvement theorem, based on which we devise an implementable RL algorithm. We
find that our algorithm outperforms both an adaptive control based method and a
deep neural networks based algorithm by a large margin in our simulations.Comment: 39 pages, 5 figure
- …