300 research outputs found
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Due to their simplicity and excellent performance, parallel asynchronous
variants of stochastic gradient descent have become popular methods to solve a
wide range of large-scale optimization problems on multi-core architectures.
Yet, despite their practical success, support for nonsmooth objectives is still
lacking, making them unsuitable for many problems of interest in machine
learning, such as the Lasso, group Lasso or empirical risk minimization with
convex constraints.
In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse
method inspired by SAGA, a variance reduced incremental gradient algorithm. The
proposed method is easy to implement and significantly outperforms the state of
the art on several nonsmooth, large-scale problems. We prove that our method
achieves a theoretical linear speedup with respect to the sequential version
under assumptions on the sparsity of gradients and block-separability of the
proximal term. Empirical benchmarks on a multi-core architecture illustrate
practical speedups of up to 12x on a 20-core machine.Comment: Appears in Advances in Neural Information Processing Systems 30 (NIPS
2017), 28 page
Stochastic Training of Neural Networks via Successive Convex Approximations
This paper proposes a new family of algorithms for training neural networks
(NNs). These are based on recent developments in the field of non-convex
optimization, going under the general name of successive convex approximation
(SCA) techniques. The basic idea is to iteratively replace the original
(non-convex, highly dimensional) learning problem with a sequence of (strongly
convex) approximations, which are both accurate and simple to optimize.
Differently from similar ideas (e.g., quasi-Newton algorithms), the
approximations can be constructed using only first-order information of the
neural network function, in a stochastic fashion, while exploiting the overall
structure of the learning problem for a faster convergence. We discuss several
use cases, based on different choices for the loss function (e.g., squared loss
and cross-entropy loss), and for the regularization of the NN's weights. We
experiment on several medium-sized benchmark problems, and on a large-scale
dataset involving simulated physical data. The results show how the algorithm
outperforms state-of-the-art techniques, providing faster convergence to a
better minimum. Additionally, we show how the algorithm can be easily
parallelized over multiple computational units without hindering its
performance. In particular, each computational unit can optimize a tailored
surrogate function defined on a randomly assigned subset of the input
variables, whose dimension can be selected depending entirely on the available
computational power.Comment: Preprint submitted to IEEE Transactions on Neural Networks and
Learning System
Distributed Big-Data Optimization via Block-Iterative Convexification and Averaging
In this paper, we study distributed big-data nonconvex optimization in
multi-agent networks. We consider the (constrained) minimization of the sum of
a smooth (possibly) nonconvex function, i.e., the agents' sum-utility, plus a
convex (possibly) nonsmooth regularizer. Our interest is in big-data problems
wherein there is a large number of variables to optimize. If treated by means
of standard distributed optimization algorithms, these large-scale problems may
be intractable, due to the prohibitive local computation and communication
burden at each node. We propose a novel distributed solution method whereby at
each iteration agents optimize and then communicate (in an uncoordinated
fashion) only a subset of their decision variables. To deal with non-convexity
of the cost function, the novel scheme hinges on Successive Convex
Approximation (SCA) techniques coupled with i) a tracking mechanism
instrumental to locally estimate gradient averages; and ii) a novel block-wise
consensus-based protocol to perform local block-averaging operations and
gradient tacking. Asymptotic convergence to stationary solutions of the
nonconvex problem is established. Finally, numerical results show the
effectiveness of the proposed algorithm and highlight how the block dimension
impacts on the communication overhead and practical convergence speed
Adaptation and learning over networks for nonlinear system modeling
In this chapter, we analyze nonlinear filtering problems in distributed
environments, e.g., sensor networks or peer-to-peer protocols. In these
scenarios, the agents in the environment receive measurements in a streaming
fashion, and they are required to estimate a common (nonlinear) model by
alternating local computations and communications with their neighbors. We
focus on the important distinction between single-task problems, where the
underlying model is common to all agents, and multitask problems, where each
agent might converge to a different model due to, e.g., spatial dependencies or
other factors. Currently, most of the literature on distributed learning in the
nonlinear case has focused on the single-task case, which may be a strong
limitation in real-world scenarios. After introducing the problem and reviewing
the existing approaches, we describe a simple kernel-based algorithm tailored
for the multitask case. We evaluate the proposal on a simulated benchmark task,
and we conclude by detailing currently open problems and lines of research.Comment: To be published as a chapter in `Adaptive Learning Methods for
Nonlinear System Modeling', Elsevier Publishing, Eds. D. Comminiello and J.C.
Principe (2018
An Accelerated Doubly Stochastic Gradient Method with Faster Explicit Model Identification
Sparsity regularized loss minimization problems play an important role in
various fields including machine learning, data mining, and modern statistics.
Proximal gradient descent method and coordinate descent method are the most
popular approaches to solving the minimization problem. Although existing
methods can achieve implicit model identification, aka support set
identification, in a finite number of iterations, these methods still suffer
from huge computational costs and memory burdens in high-dimensional scenarios.
The reason is that the support set identification in these methods is implicit
and thus cannot explicitly identify the low-complexity structure in practice,
namely, they cannot discard useless coefficients of the associated features to
achieve algorithmic acceleration via dimension reduction. To address this
challenge, we propose a novel accelerated doubly stochastic gradient descent
(ADSGD) method for sparsity regularized loss minimization problems, which can
reduce the number of block iterations by eliminating inactive coefficients
during the optimization process and eventually achieve faster explicit model
identification and improve the algorithm efficiency. Theoretically, we first
prove that ADSGD can achieve a linear convergence rate and lower overall
computational complexity. More importantly, we prove that ADSGD can achieve a
linear rate of explicit model identification. Numerically, experimental results
on benchmark datasets confirm the efficiency of our proposed method
Recommended from our members
Hybrid-Parallel Parameter Estimation for Frequentist and Bayesian Models
Distributed algorithms in machine learning follow two main flavors: horizontal partitioning, where the data is distributed across multiple slaves and vertical partitioning, where the model parameters are partitioned across multiple machines. The main drawback of the former strategy is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. This drawback of the latter strategy is that the data needs to be replicated on each machine, thus failing to scale to massive datasets.The goal of this thesis is to achieve the best of both worlds by partitioning both - the data as well as the model parameters, thus enabling the training of more sophisticated models on massive datasets. In order to do so, we exploit a structure that is observed in several machine learning models, which we term as \textit{Double-Separability}. Double-Separability basically means that the objective function of the model can be decomposed into independent sub-functions which can be computed independently. For distributed machine learning, this implies that both data and model parameters can partitioned across machines and stochastic updates for parameters can be carried out independently and without any locking. Furthermore, double-separability naturally lends itself to developing efficient asynchronous algorithms which enable computation and communication to happen in parallel, offering further speedup. Some machine learning models such as Matrix Factorization directly exhibit double-separability in their objective function, however the majority of models do not. My work explores techniques to reformulate the objective function of such models to cast them into double-separable form. Often this involves introducing additional auxiliary variables that have nice interpretations. In this direction, I have developed Hybrid Parallel algorithms for machine learning tasks that include {\it Latent Collaborative Retrieval}, {\it Multinomial Logistic Regression}, {\it Variational Inference for Mixture of Exponential Families} and {\it Factorization Machines}. The software resulting from this work are available for public use under an open-source license
- …