5,963 research outputs found
Pattern Search Multidimensional Scaling
We present a novel view of nonlinear manifold learning using derivative-free
optimization techniques. Specifically, we propose an extension of the classical
multi-dimensional scaling (MDS) method, where instead of performing gradient
descent, we sample and evaluate possible "moves" in a sphere of fixed radius
for each point in the embedded space. A fixed-point convergence guarantee can
be shown by formulating the proposed algorithm as an instance of General
Pattern Search (GPS) framework. Evaluation on both clean and noisy synthetic
datasets shows that pattern search MDS can accurately infer the intrinsic
geometry of manifolds embedded in high-dimensional spaces. Additionally,
experiments on real data, even under noisy conditions, demonstrate that the
proposed pattern search MDS yields state-of-the-art results.Comment: 36 pages, Under review for JML
Automata Theory Meets Barrier Certificates: Temporal Logic Verification of Nonlinear Systems
We consider temporal logic verification of (possibly nonlinear) dynamical
systems evolving over continuous state spaces. Our approach combines
automata-based verification and the use of so-called barrier certificates.
Automata-based verification allows the decomposition the verification task into
a finite collection of simpler constraints over the continuous state space. The
satisfaction of these constraints in turn can be (potentially conservatively)
proved by appropriately constructed barrier certificates. As a result, our
approach, together with optimization-based search for barrier certificates,
allows computational verification of dynamical systems against temporal logic
properties while avoiding explicit abstractions of the dynamics as commonly
done in literature
A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models
Beam search is a desirable choice of test-time decoding algorithm for neural
sequence models because it potentially avoids search errors made by simpler
greedy methods. However, typical cross entropy training procedures for these
models do not directly consider the behaviour of the final decoding method. As
a result, for cross-entropy trained models, beam decoding can sometimes yield
reduced test performance when compared with greedy decoding. In order to train
models that can more effectively make use of beam search, we propose a new
training procedure that focuses on the final loss metric (e.g. Hamming loss)
evaluated on the output of beam search. While well-defined, this "direct loss"
objective is itself discontinuous and thus difficult to optimize. Hence, in our
approach, we form a sub-differentiable surrogate objective by introducing a
novel continuous approximation of the beam search decoding procedure. In
experiments, we show that optimizing this new training objective yields
substantially better results on two sequence tasks (Named Entity Recognition
and CCG Supertagging) when compared with both cross entropy trained greedy
decoding and cross entropy trained beam decoding baselines.Comment: Updated for clarity and notational consistenc
A convergent hierarchy of non-linear eigenproblems to compute the joint spectral radius of nonnegative matrices
We show that the joint spectral radius of a finite collection of nonnegative
matrices can be bounded by the eigenvalue of a non-linear operator. This
eigenvalue coincides with the ergodic constant of a risk-sensitive control
problem, or of an entropy game, in which the state space consists of all
switching sequences of a given length. We show that, by increasing this length,
we arrive at a convergent approximation scheme to compute the joint spectral
radius. The complexity of this method is exponential in the length of the
switching sequences, but it is quite insensitive to the size of the matrices,
allowing us to solve very large scale instances (several matrices in dimensions
of order 1000 within a minute). An idea of this method is to replace a
hierarchy of optimization problems, introduced by Ahmadi, Jungers, Parrilo and
Roozbehani, by a hierarchy of nonlinear eigenproblems. To solve the latter
eigenproblems, we introduce a projective version of Krasnoselskii-Mann
iteration. This method is of independent interest as it applies more generally
to the nonlinear eigenproblem for a monotone positively homogeneous map. Here,
this method allows for scalability by avoiding the recourse to linear or
semidefinite programming techniques.Comment: 18 page
Why gradient clipping accelerates training: A theoretical justification for adaptivity
We provide a theoretical explanation for the effectiveness of gradient
clipping in training deep neural networks. The key ingredient is a new
smoothness condition derived from practical neural network training examples.
We observe that gradient smoothness, a concept central to the analysis of
first-order optimization algorithms that is often assumed to be a constant,
demonstrates significant variability along the training trajectory of deep
neural networks. Further, this smoothness positively correlates with the
gradient norm, and contrary to standard assumptions in the literature, it can
grow with the norm of the gradient. These empirical observations limit the
applicability of existing theoretical analyses of algorithms that rely on a
fixed bound on smoothness. These observations motivate us to introduce a novel
relaxation of gradient smoothness that is weaker than the commonly used
Lipschitz smoothness assumption. Under the new condition, we prove that two
popular methods, namely, \emph{gradient clipping} and \emph{normalized
gradient}, converge arbitrarily faster than gradient descent with fixed
stepsize. We further explain why such adaptively scaled gradient methods can
accelerate empirical convergence and verify our results empirically in popular
neural network training settings
Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes
We extend neural Turing machine (NTM) model into a dynamic neural Turing
machine (D-NTM) by introducing a trainable memory addressing scheme. This
addressing scheme maintains for each memory cell two separate vectors, content
and address vectors. This allows the D-NTM to learn a wide variety of
location-based addressing strategies including both linear and nonlinear ones.
We implement the D-NTM with both continuous, differentiable and discrete,
non-differentiable read/write mechanisms. We investigate the mechanisms and
effects of learning to read and write into a memory through experiments on
Facebook bAbI tasks using both a feedforward and GRUcontroller. The D-NTM is
evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM
baselines. We have done extensive analysis of our model and different
variations of NTM on bAbI task. We also provide further experimental results on
sequential pMNIST, Stanford Natural Language Inference, associative recall and
copy tasks.Comment: 13 pages, 3 figure
End-to-End Learning for Structured Prediction Energy Networks
Structured Prediction Energy Networks (SPENs) are a simple, yet expressive
family of structured prediction models (Belanger and McCallum, 2016). An energy
function over candidate structured outputs is given by a deep network, and
predictions are formed by gradient-based optimization. This paper presents
end-to-end learning for SPENs, where the energy function is discriminatively
trained by back-propagating through gradient-based prediction. In our
experience, the approach is substantially more accurate than the structured SVM
method of Belanger and McCallum (2016), as it allows us to use more
sophisticated non-convex energies. We provide a collection of techniques for
improving the speed, accuracy, and memory requirements of end-to-end SPENs, and
demonstrate the power of our method on 7-Scenes image denoising and CoNLL-2005
semantic role labeling tasks. In both, inexact minimization of non-convex SPEN
energies is superior to baseline methods that use simplistic energy functions
that can be minimized exactly.Comment: ICML 201
Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain
DTW calculates the similarity or alignment between two signals, subject to
temporal warping. However, its computational complexity grows exponentially
with the number of time-series. Although there have been algorithms developed
that are linear in the number of time-series, they are generally quadratic in
time-series length. The exception is generalized time warping (GTW), which has
linear computational cost. Yet, it can only identify simple time warping
functions. There is a need for a new fast, high-quality multisequence alignment
algorithm. We introduce trainable time warping (TTW), whose complexity is
linear in both the number and the length of time-series. TTW performs alignment
in the continuous-time domain using a sinc convolutional kernel and a
gradient-based optimization technique. We compare TTW and GTW on 85 UCR
datasets in time-series averaging and classification. TTW outperforms GTW on
67.1% of the datasets for the averaging tasks, and 61.2% of the datasets for
the classification tasks.Comment: ICASSP 201
Optimization Methods for Large-Scale Machine Learning
This paper provides a review and commentary on the past, present, and future
of numerical optimization algorithms in the context of machine learning
applications. Through case studies on text classification and the training of
deep neural networks, we discuss how optimization problems arise in machine
learning and what makes them challenging. A major theme of our study is that
large-scale machine learning represents a distinctive setting in which the
stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter.
Based on this viewpoint, we present a comprehensive theory of a
straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance.
This leads to a discussion about the next generation of optimization methods
for large-scale machine learning, including an investigation of two main
streams of research on techniques that diminish noise in the stochastic
directions and methods that make use of second-order derivative approximations
A gradient-type algorithm for constrained optimization with applications to multi-objective optimization of auxetic materials
An algorithm is devised for solving minimization problems with equality
constraints. The algorithm uses first-order derivatives of both the objective
function and the constraints. The step is computed as a sum between a
steepest-descent step (which minimizes the objective functional) and a
correction step related to the Newton method(which aims to solve the equality
constraints). The linear combination between these two steps involves
coefficients similar to Lagrange multipliers which are computed in a natural
way based on the Newton method. The algorithm uses no projection and thus the
iterates are not feasible; the constraints are satisfied only in the limit
(after convergence). This algorithm was proposed by one of the authors in a
previous paper. In the present paper, a local convergence result is proven for
a general non-linear setting, where both the objective functional and the
constraints are not necessarily convex functions. The algorithm is extended, by
means of an active set strategy, to account also for inequality constraints and
to address minimax problems. The method is then applied to the optimization of
periodic microstructures for obtaining homogenized elastic tensors having
negative Poisson ratio (so-called auxetic materials) using shape and/or
topology variations in the model hole. In previous works of the same authors,
anisotropic homogenized tensors have been obtained which exhibit negative
Poisson ratio in a prescribed direction of the plane. In the present work, a
new approach is proposed, that employs multi-objective optimization in order to
minimize the Poisson ratio of the (possibly anisotropic) homogenized elastic
tensor in several prescribed directions of the plane. Numerical examples are
presented.Comment: 32 pages, 7 figure
- …