20 research outputs found

    Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss

    Full text link
    We consider distributed convex optimization problems originated from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, constructed with i.i.d. data sampled from a common distribution. We propose a communication-efficient distributed algorithm to minimize the overall empirical loss, which is the average of the local empirical losses. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions, and discuss the results for distributed ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning, the required number of communication rounds of the algorithm does not increase with the sample size, and only grows slowly with the number of machines

    Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?

    Full text link
    In this paper we study the effect of the way that the data is partitioned in distributed optimization. The original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015] partitions the input data based on samples. We describe how the original algorithm has to be modified to allow partitioning on features and show its efficiency both in theory and also in practice

    Distributed Inexact Damped Newton Method: Data Partitioning and Load-Balancing

    Full text link
    In this paper we study inexact dumped Newton method implemented in a distributed environment. We start with an original DiSCO algorithm [Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and Lin Xiao, 2015]. We will show that this algorithm may not scale well and propose an algorithmic modifications which will lead to less communications, better load-balancing and more efficient computation. We perform numerical experiments with an regularized empirical loss minimization instance described by a 273GB dataset

    Federated Optimization:Distributed Optimization Beyond the Datacenter

    Full text link
    We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance. A motivating example for federated optimization arises when we keep the training data locally on users' mobile devices rather than logging it to a data center for training. Instead, the mobile devices are used as nodes performing computation on their local data in order to update a global model. We suppose that we have an extremely large number of devices in our network, each of which has only a tiny fraction of data available totally; in particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, we assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results. This work also sets a path for future research needed in the context of federated optimization.Comment: NIPS workshop versio

    An Accelerated Communication-Efficient Primal-Dual Optimization Framework for Structured Machine Learning

    Full text link
    Distributed optimization algorithms are essential for training machine learning models on very large-scale datasets. However, they often suffer from communication bottlenecks. Confronting this issue, a communication-efficient primal-dual coordinate ascent framework (CoCoA) and its improved variant CoCoA+ have been proposed, achieving a convergence rate of O(1/t)\mathcal{O}(1/t) for solving empirical risk minimization problems with Lipschitz continuous losses. In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of O(1/t2)\mathcal{O}(1/t^2) in terms of reducing suboptimality. The analysis of this rate is also notable in that the convergence rate bounds involve constants that, except in extreme cases, are significantly reduced compared to those previously provided for CoCoA+. The results of numerical experiments are provided to show that acceleration can lead to significant performance gains

    Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed SR1

    Full text link
    In this paper, we present a scalable distributed implementation of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant that: (i) drastically reduces the amount of data communicated at every iteration, (ii) has favorable work-load balancing across nodes, and (iii) is matrix-free and inverse-free. The proposed method scales well in terms of both the dimension of the problem and the number of data points. Finally, we illustrate the empirical performance of DS-LSR1 on a standard neural network training task.Comment: 24 pages, 14 figures, 6 table

    Randomized block proximal damped Newton method for composite self-concordant minimization

    Full text link
    In this paper we consider the composite self-concordant (CSC) minimization problem, which minimizes the sum of a self-concordant function ff and a (possibly nonsmooth) proper closed convex function gg. The CSC minimization is the cornerstone of the path-following interior point methods for solving a broad class of convex optimization problems. It has also found numerous applications in machine learning. The proximal damped Newton (PDN) methods have been well studied in the literature for solving this problem that enjoy a nice iteration complexity. Given that at each iteration these methods typically require evaluating or accessing the Hessian of ff and also need to solve a proximal Newton subproblem, the cost per iteration can be prohibitively high when applied to large-scale problems. Inspired by the recent success of block coordinate descent methods, we propose a randomized block proximal damped Newton (RBPDN) method for solving the CSC minimization. Compared to the PDN methods, the computational cost per iteration of RBPDN is usually significantly lower. The computational experiment on a class of regularized logistic regression problems demonstrate that RBPDN is indeed promising in solving large-scale CSC minimization problems. The convergence of RBPDN is also analyzed in the paper. In particular, we show that RBPDN is globally convergent when gg is Lipschitz continuous. It is also shown that RBPDN enjoys a local linear convergence. Moreover, we show that for a class of gg including the case where gg is Lipschitz differentiable, RBPDN enjoys a global linear convergence. As a striking consequence, it shows that the classical damped Newton methods [22,40] and the PDN [31] for such gg are globally linearly convergent, which was previously unknown in the literature. Moreover, this result can be used to sharpen the existing iteration complexity of these methods.Comment: 29 page

    Communication Complexity of Distributed Convex Learning and Optimization

    Full text link
    We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power

    Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

    Full text link
    We study distributed optimization algorithms for minimizing the average of convex functions. The applications include empirical risk minimization problems in statistical machine learning where the datasets are large and have to be stored on different machines. We design a distributed stochastic variance reduced gradient algorithm that, under certain conditions on the condition number, simultaneously achieves the optimal parallel runtime, amount of communication and rounds of communication among all distributed first-order methods up to constant factors. Our method and its accelerated extension also outperform existing distributed algorithms in terms of the rounds of communication as long as the condition number is not too large compared to the size of data in each machine. We also prove a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper. We show that our accelerated distributed stochastic variance reduced gradient algorithm achieves this lower bound so that it uses the fewest rounds of communication among all distributed first-order algorithms.Comment: significant addition to both theory and experimental result

    Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization

    Full text link
    Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled \emph{with} replacement. In practice, however, sampling \emph{without} replacement is very common, easier to implement in many cases, and often performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling, under various scenarios, for three types of algorithms: Any algorithm with online regret guarantees, stochastic gradient descent, and SVRG. A useful application of our SVRG analysis is a nearly-optimal algorithm for regularized least squares in a distributed setting, in terms of both communication complexity and runtime complexity, when the data is randomly partitioned and the condition number can be as large as the data size per machine (up to logarithmic factors). Our proof techniques combine ideas from stochastic optimization, adversarial online learning, and transductive learning theory, and can potentially be applied to other stochastic optimization and learning problems.Comment: Fixed a few minor typos, and slightly tightened Corollary
    corecore