Search CORE

2,090 research outputs found

Optimization Methods for Large-Scale Machine Learning

Author: Bottou Léon
Curtis Frank E.
Nocedal Jorge
Publication venue
Publication date: 08/02/2018
Field of study

This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations

arXiv.org e-Print Archive

Deep Learning for Passive Synthetic Aperture Radar

Author: Mason Eric
Yazıcı Birsen
Yonel Bariscan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/08/2017
Field of study

We introduce a deep learning (DL) framework for inverse problems in imaging, and demonstrate the advantages and applicability of this approach in passive synthetic aperture radar (SAR) image reconstruction. We interpret image recon- struction as a machine learning task and utilize deep networks as forward and inverse solvers for imaging. Specifically, we design a recurrent neural network (RNN) architecture as an inverse solver based on the iterations of proximal gradient descent optimization methods. We further adapt the RNN architecture to image reconstruction problems by transforming the network into a recurrent auto-encoder, thereby allowing for unsupervised training. Our DL based inverse solver is particularly suitable for a class of image formation problems in which the forward model is only partially known. The ability to learn forward models and hyper parameters combined with unsupervised training approach establish our recurrent auto-encoder suitable for real world applications. We demonstrate the performance of our method in passive SAR image reconstruction. In this regime a source of opportunity, with unknown location and transmitted waveform, is used to illuminate a scene of interest. We investigate recurrent auto- encoder architecture based on the 1 and 0 constrained least- squares problem. We present a projected stochastic gradient descent based training scheme which incorporates constraints of the unknown model parameters. We demonstrate through extensive numerical simulations that our DL based approach out performs conventional sparse coding methods in terms of computation and reconstructed image quality, specifically, when no information about the transmitter is available.Comment: Submitted to IEEE Journal of Selected Topics in Signal Processin

arXiv.org e-Print Archive

An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

Author: Hao Shuji
Li Qianxiao
Publication venue
Publication date: 02/06/2018
Field of study

Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin's maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices

arXiv.org e-Print Archive

Maximum Principle Based Algorithms for Deep Learning

Author: Chen Long
E Weinan
Li Qianxiao
Tai Cheng
Publication venue
Publication date: 02/06/2018
Field of study

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.Comment: Published versio

arXiv.org e-Print Archive

Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap

Author: Hassani Hamed
Karbasi Amin
Mokhtari Aryan
Publication venue
Publication date: 05/11/2017
Field of study

In this paper, we study the problem of \textit{constrained} and \textit{stochastic} continuous submodular maximization. Even though the objective function is not concave (nor convex) and is defined in terms of an expectation, we develop a variant of the conditional gradient method, called \alg, which achieves a \textit{tight} approximation guarantee. More precisely, for a monotone and continuous DR-submodular function and subject to a \textit{general} convex body constraint, we prove that \alg achieves a [(1-1/e)\text{OPT} -\eps] guarantee (in expectation) with \mathcal{O}{(1/\eps^3)} stochastic gradient computations. This guarantee matches the known hardness results and closes the gap between deterministic and stochastic continuous submodular maximization. By using stochastic continuous optimization as an interface, we also provide the first

(1-1/e)

tight approximation guarantee for maximizing a \textit{monotone but stochastic} submodular \textit{set} function subject to a general matroid constraint

arXiv.org e-Print Archive

DynaNewton - Accelerating Newton's Method for Machine Learning

Author: Daneshmand Hadi
Hofmann Thomas
Lucchi Aurelien
Publication venue
Publication date: 20/05/2016
Field of study

Newton's method is a fundamental technique in optimization with quadratic convergence within a neighborhood around the optimum. However reaching this neighborhood is often slow and dominates the computational costs. We exploit two properties specific to empirical risk minimization problems to accelerate Newton's method, namely, subsampling training data and increasing strong convexity through regularization. We propose a novel continuation method, where we define a family of objectives over increasing sample sizes and with decreasing regularization strength. Solutions on this path are tracked such that the minimizer of the previous objective is guaranteed to be within the quadratic convergence region of the next objective to be optimized. Thereby every Newton iteration is guaranteed to achieve super-linear contractions with regard to the chosen objective, which becomes a moving target. We provide a theoretical analysis that motivates our algorithm, called DynaNewton, and characterizes its speed of convergence. Experiments on a wide range of data sets and problems consistently confirm the predicted computational savings

arXiv.org e-Print Archive

Gaussian Robust Classification

Author: Ginodi Ido
Globerson Amir
Publication venue
Publication date: 01/04/2011
Field of study

Supervised learning is all about the ability to generalize knowledge. Specifically, the goal of the learning is to train a classifier using training data, in such a way that it will be capable of classifying new unseen data correctly. In order to acheive this goal, it is important to carefully design the learner, so it will not overfit the training data. The later can is done usually by adding a regularization term. The statistical learning theory explains the success of this method by claiming that it restricts the complexity of the learned model. This explanation, however, is rather abstract and does not have a geometric intuition. The generalization error of a classifier may be thought of as correlated with its robustness to perturbations of the data: a classifier that copes with disturbance is expected to generalize well. Indeed, Xu et al. [2009] have shown that the SVM formulation is equivalent to a robust optimization (RO) formulation, in which an adversary displaces the training and testing points within a ball of pre-determined radius. In this work we explore a different kind of robustness, namely changing each data point with a Gaussian cloud centered at the sample. Loss is evaluated as the expectation of an underlying loss function on the cloud. This setup fits the fact that in many applications, the data is sampled along with noise. We develop an RO framework, in which the adversary chooses the covariance of the noise. In our algorithm named GURU, the tuning parameter is a spectral bound on the noise, thus it can be estimated using physical or applicative considerations. Our experiments show that this framework performs as well as SVM and even slightly better in some cases. Generalizations for Mercer kernels and for the multiclass case are presented as well. We also show that our framework may be further generalized, using the technique of convex perspective functions.Comment: Master's dissertation of the first author, carried out under the supervision of the second autho

arXiv.org e-Print Archive

Identifying global optimality for dictionary learning

Author: Le Lei
White Martha
Publication venue
Publication date: 06/08/2017
Field of study

Learning new representations of input observations in machine learning is often tackled using a factorization of the data. For many such problems, including sparse coding and matrix completion, learning these factorizations can be difficult, in terms of efficiency and to guarantee that the solution is a global minimum. Recently, a general class of objectives have been introduced-which we term induced dictionary learning models (DLMs)-that have an induced convex form that enables global optimization. Though attractive theoretically, this induced form is impractical, particularly for large or growing datasets. In this work, we investigate the use of practical alternating minimization algorithms for induced DLMs, that ensure convergence to global optima. We characterize the stationary points of these models, and, using these insights, highlight practical choices for the objectives. We then provide theoretical and empirical evidence that alternating minimization, from a random initialization, converges to global minima for a large subclass of induced DLMs. In particular, we take advantage of the existence of the (potentially unknown) convex induced form, to identify when stationary points are global minima for the dictionary learning objective. We then provide an empirical investigation into practical optimization choices for using alternating minimization for induced DLMs, for both batch and stochastic gradient descent.Comment: Updates to previous version include a small modification to Proposition 2, to only use normed regularizers, and a modification to the main theorem (previously Theorem 13) to focus on the overcomplete, full rank setting and to better characterize non-differentiable induced regularizers. The theory has been significantly modified since version

arXiv.org e-Print Archive

User-Centric Joint Access-Backhaul Design for Full-Duplex Self-Backhauled Wireless Networks

Author: Chen Erkai
Tao Meixia
Zhang Nan
Publication venue
Publication date: 27/07/2019
Field of study

Full-duplex self-backhauling is promising to provide cost-effective and flexible backhaul connectivity for ultra-dense wireless networks, but also poses a great challenge to resource management between the access and backhaul links. In this paper, we propose a user-centric joint access-backhaul transmission framework for full-duplex self-backhauled wireless networks. In the access link, user-centric clustering is adopted so that each user is cooperatively served by multiple small base stations (SBSs). In the backhaul link, user-centric multicast transmission is proposed so that each user's message is treated as a common message and multicast to its serving SBS cluster. We first formulate an optimization problem to maximize the network weighted sum rate through joint access-backhaul beamforming and SBS clustering when global channel state information (CSI) is available. This problem is efficiently solved via the successive lower-bound maximization approach with a novel approximate objective function and the iterative link removal technique. We then extend the study to the stochastic joint access-backhaul beamforming optimization with partial CSI. Simulation results demonstrate the effectiveness of the proposed algorithms for both full CSI and partial CSI scenarios. They also show that the transmission design with partial CSI can greatly reduce the CSI overhead with little performance degradation.Comment: to appear in IEEE Trans. on Communication

arXiv.org e-Print Archive

Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach

Author: McLachlan Geoffrey J.
Nguyen Hien D.
Publication venue
Publication date: 01/01/2017
Field of study

Support vector machines (SVMs) are an important tool in modern data analysis. Traditionally, support vector machines have been fitted via quadratic programming, either using purpose-built or off-the-shelf algorithms. We present an alternative approach to SVM fitting via the majorization--minimization (MM) paradigm. Algorithms that are derived via MM algorithm constructions can be shown to monotonically decrease their objectives at each iteration, as well as be globally convergent to stationary points. We demonstrate the construction of iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm, for SVM risk minimization problems involving the hinge, least-square, squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net penalizations. Successful implementations of our algorithms are presented via some numerical examples

arXiv.org e-Print Archive

University of Queensland eSpace