Search CORE

5,963 research outputs found

Pattern Search Multidimensional Scaling

Author: Paraskevopoulos Georgios
Potamianos Alexandros
Tzinis Efthymios
Vlatakis-Gkaragkounis Emmanouil-Vasileios
Publication venue
Publication date: 30/10/2019
Field of study

We present a novel view of nonlinear manifold learning using derivative-free optimization techniques. Specifically, we propose an extension of the classical multi-dimensional scaling (MDS) method, where instead of performing gradient descent, we sample and evaluate possible "moves" in a sphere of fixed radius for each point in the embedded space. A fixed-point convergence guarantee can be shown by formulating the proposed algorithm as an instance of General Pattern Search (GPS) framework. Evaluation on both clean and noisy synthetic datasets shows that pattern search MDS can accurately infer the intrinsic geometry of manifolds embedded in high-dimensional spaces. Additionally, experiments on real data, even under noisy conditions, demonstrate that the proposed pattern search MDS yields state-of-the-art results.Comment: 36 pages, Under review for JML

arXiv.org e-Print Archive

Automata Theory Meets Barrier Certificates: Temporal Logic Verification of Nonlinear Systems

Author: Lamperski Andrew
Topcu Ufuk
Wongpiromsarn Tichakorn
Publication venue
Publication date: 14/03/2014
Field of study

We consider temporal logic verification of (possibly nonlinear) dynamical systems evolving over continuous state spaces. Our approach combines automata-based verification and the use of so-called barrier certificates. Automata-based verification allows the decomposition the verification task into a finite collection of simpler constraints over the continuous state space. The satisfaction of these constraints in turn can be (potentially conservatively) proved by appropriately constructed barrier certificates. As a result, our approach, together with optimization-based search for barrier certificates, allows computational verification of dynamical systems against temporal logic properties while avoiding explicit abstractions of the dynamics as commonly done in literature

arXiv.org e-Print Archive

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Author: Berg-Kirkpatrick Taylor
Dyer Chris
Goyal Kartik
Neubig Graham
Publication venue
Publication date: 06/10/2017
Field of study

Beam search is a desirable choice of test-time decoding algorithm for neural sequence models because it potentially avoids search errors made by simpler greedy methods. However, typical cross entropy training procedures for these models do not directly consider the behaviour of the final decoding method. As a result, for cross-entropy trained models, beam decoding can sometimes yield reduced test performance when compared with greedy decoding. In order to train models that can more effectively make use of beam search, we propose a new training procedure that focuses on the final loss metric (e.g. Hamming loss) evaluated on the output of beam search. While well-defined, this "direct loss" objective is itself discontinuous and thus difficult to optimize. Hence, in our approach, we form a sub-differentiable surrogate objective by introducing a novel continuous approximation of the beam search decoding procedure. In experiments, we show that optimizing this new training objective yields substantially better results on two sequence tasks (Named Entity Recognition and CCG Supertagging) when compared with both cross entropy trained greedy decoding and cross entropy trained beam decoding baselines.Comment: Updated for clarity and notational consistenc

arXiv.org e-Print Archive

A convergent hierarchy of non-linear eigenproblems to compute the joint spectral radius of nonnegative matrices

Author: Gaubert Stephane
Stott Nikolas
Publication venue
Publication date: 08/05/2018
Field of study

We show that the joint spectral radius of a finite collection of nonnegative matrices can be bounded by the eigenvalue of a non-linear operator. This eigenvalue coincides with the ergodic constant of a risk-sensitive control problem, or of an entropy game, in which the state space consists of all switching sequences of a given length. We show that, by increasing this length, we arrive at a convergent approximation scheme to compute the joint spectral radius. The complexity of this method is exponential in the length of the switching sequences, but it is quite insensitive to the size of the matrices, allowing us to solve very large scale instances (several matrices in dimensions of order 1000 within a minute). An idea of this method is to replace a hierarchy of optimization problems, introduced by Ahmadi, Jungers, Parrilo and Roozbehani, by a hierarchy of nonlinear eigenproblems. To solve the latter eigenproblems, we introduce a projective version of Krasnoselskii-Mann iteration. This method is of independent interest as it applies more generally to the nonlinear eigenproblem for a monotone positively homogeneous map. Here, this method allows for scalability by avoiding the recourse to linear or semidefinite programming techniques.Comment: 18 page

arXiv.org e-Print Archive

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Author: He Tianxing
Jadbabaie Ali
Sra Suvrit
Zhang Jingzhao
Publication venue
Publication date: 10/02/2020
Field of study

We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings

arXiv.org e-Print Archive

Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes

Author: Bengio Yoshua
Chandar Sarath
Cho Kyunghyun
Gulcehre Caglar
Publication venue
Publication date: 17/03/2017
Field of study

We extend neural Turing machine (NTM) model into a dynamic neural Turing machine (D-NTM) by introducing a trainable memory addressing scheme. This addressing scheme maintains for each memory cell two separate vectors, content and address vectors. This allows the D-NTM to learn a wide variety of location-based addressing strategies including both linear and nonlinear ones. We implement the D-NTM with both continuous, differentiable and discrete, non-differentiable read/write mechanisms. We investigate the mechanisms and effects of learning to read and write into a memory through experiments on Facebook bAbI tasks using both a feedforward and GRUcontroller. The D-NTM is evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM baselines. We have done extensive analysis of our model and different variations of NTM on bAbI task. We also provide further experimental results on sequential pMNIST, Stanford Natural Language Inference, associative recall and copy tasks.Comment: 13 pages, 3 figure

arXiv.org e-Print Archive

End-to-End Learning for Structured Prediction Energy Networks

Author: Belanger David
McCallum Andrew
Yang Bishan
Publication venue
Publication date: 15/07/2017
Field of study

Structured Prediction Energy Networks (SPENs) are a simple, yet expressive family of structured prediction models (Belanger and McCallum, 2016). An energy function over candidate structured outputs is given by a deep network, and predictions are formed by gradient-based optimization. This paper presents end-to-end learning for SPENs, where the energy function is discriminatively trained by back-propagating through gradient-based prediction. In our experience, the approach is substantially more accurate than the structured SVM method of Belanger and McCallum (2016), as it allows us to use more sophisticated non-convex energies. We provide a collection of techniques for improving the speed, accuracy, and memory requirements of end-to-end SPENs, and demonstrate the power of our method on 7-Scenes image denoising and CoNLL-2005 semantic role labeling tasks. In both, inexact minimization of non-convex SPEN energies is superior to baseline methods that use simplistic energy functions that can be minimized exactly.Comment: ICML 201

arXiv.org e-Print Archive

Trainable Time Warping: Aligning Time-Series in the Continuous-Time Domain

Author: Khorram Soheil
McInnis Melvin G
Provost Emily Mower
Publication venue
Publication date: 21/03/2019
Field of study

DTW calculates the similarity or alignment between two signals, subject to temporal warping. However, its computational complexity grows exponentially with the number of time-series. Although there have been algorithms developed that are linear in the number of time-series, they are generally quadratic in time-series length. The exception is generalized time warping (GTW), which has linear computational cost. Yet, it can only identify simple time warping functions. There is a need for a new fast, high-quality multisequence alignment algorithm. We introduce trainable time warping (TTW), whose complexity is linear in both the number and the length of time-series. TTW performs alignment in the continuous-time domain using a sinc convolutional kernel and a gradient-based optimization technique. We compare TTW and GTW on 85 UCR datasets in time-series averaging and classification. TTW outperforms GTW on 67.1% of the datasets for the averaging tasks, and 61.2% of the datasets for the classification tasks.Comment: ICASSP 201

arXiv.org e-Print Archive

Optimization Methods for Large-Scale Machine Learning

Author: Bottou Léon
Curtis Frank E.
Nocedal Jorge
Publication venue
Publication date: 08/02/2018
Field of study

This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations

arXiv.org e-Print Archive

A gradient-type algorithm for constrained optimization with applications to multi-objective optimization of auxetic materials

Author: Barbarosie Cristian
Lopes Sérgio
Toader Anca-Maria
Publication venue
Publication date: 13/11/2017
Field of study

An algorithm is devised for solving minimization problems with equality constraints. The algorithm uses first-order derivatives of both the objective function and the constraints. The step is computed as a sum between a steepest-descent step (which minimizes the objective functional) and a correction step related to the Newton method(which aims to solve the equality constraints). The linear combination between these two steps involves coefficients similar to Lagrange multipliers which are computed in a natural way based on the Newton method. The algorithm uses no projection and thus the iterates are not feasible; the constraints are satisfied only in the limit (after convergence). This algorithm was proposed by one of the authors in a previous paper. In the present paper, a local convergence result is proven for a general non-linear setting, where both the objective functional and the constraints are not necessarily convex functions. The algorithm is extended, by means of an active set strategy, to account also for inequality constraints and to address minimax problems. The method is then applied to the optimization of periodic microstructures for obtaining homogenized elastic tensors having negative Poisson ratio (so-called auxetic materials) using shape and/or topology variations in the model hole. In previous works of the same authors, anisotropic homogenized tensors have been obtained which exhibit negative Poisson ratio in a prescribed direction of the plane. In the present work, a new approach is proposed, that employs multi-objective optimization in order to minimize the Poisson ratio of the (possibly anisotropic) homogenized elastic tensor in several prescribed directions of the plane. Numerical examples are presented.Comment: 32 pages, 7 figure

arXiv.org e-Print Archive