20 research outputs found
Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss
We consider distributed convex optimization problems originated from sample
average approximation of stochastic optimization, or empirical risk
minimization in machine learning. We assume that each machine in the
distributed computing system has access to a local empirical loss function,
constructed with i.i.d. data sampled from a common distribution. We propose a
communication-efficient distributed algorithm to minimize the overall empirical
loss, which is the average of the local empirical losses. The algorithm is
based on an inexact damped Newton method, where the inexact Newton steps are
computed by a distributed preconditioned conjugate gradient method. We analyze
its iteration complexity and communication efficiency for minimizing
self-concordant empirical loss functions, and discuss the results for
distributed ridge regression, logistic regression and binary classification
with a smoothed hinge loss. In a standard setting for supervised learning, the
required number of communication rounds of the algorithm does not increase with
the sample size, and only grows slowly with the number of machines
Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?
In this paper we study the effect of the way that the data is partitioned in
distributed optimization. The original DiSCO algorithm [Communication-Efficient
Distributed Optimization of Self-Concordant Empirical Loss, Yuchen Zhang and
Lin Xiao, 2015] partitions the input data based on samples. We describe how the
original algorithm has to be modified to allow partitioning on features and
show its efficiency both in theory and also in practice
Distributed Inexact Damped Newton Method: Data Partitioning and Load-Balancing
In this paper we study inexact dumped Newton method implemented in a
distributed environment. We start with an original DiSCO algorithm
[Communication-Efficient Distributed Optimization of Self-Concordant Empirical
Loss, Yuchen Zhang and Lin Xiao, 2015]. We will show that this algorithm may
not scale well and propose an algorithmic modifications which will lead to less
communications, better load-balancing and more efficient computation. We
perform numerical experiments with an regularized empirical loss minimization
instance described by a 273GB dataset
Federated Optimization:Distributed Optimization Beyond the Datacenter
We introduce a new and increasingly relevant setting for distributed
optimization in machine learning, where the data defining the optimization are
distributed (unevenly) over an extremely large number of \nodes, but the goal
remains to train a high-quality centralized model. We refer to this setting as
Federated Optimization. In this setting, communication efficiency is of utmost
importance.
A motivating example for federated optimization arises when we keep the
training data locally on users' mobile devices rather than logging it to a data
center for training. Instead, the mobile devices are used as nodes performing
computation on their local data in order to update a global model. We suppose
that we have an extremely large number of devices in our network, each of which
has only a tiny fraction of data available totally; in particular, we expect
the number of data points available locally to be much smaller than the number
of devices. Additionally, since different users generate data with different
patterns, we assume that no device has a representative sample of the overall
distribution.
We show that existing algorithms are not suitable for this setting, and
propose a new algorithm which shows encouraging experimental results. This work
also sets a path for future research needed in the context of federated
optimization.Comment: NIPS workshop versio
An Accelerated Communication-Efficient Primal-Dual Optimization Framework for Structured Machine Learning
Distributed optimization algorithms are essential for training machine
learning models on very large-scale datasets. However, they often suffer from
communication bottlenecks. Confronting this issue, a communication-efficient
primal-dual coordinate ascent framework (CoCoA) and its improved variant CoCoA+
have been proposed, achieving a convergence rate of for
solving empirical risk minimization problems with Lipschitz continuous losses.
In this paper, an accelerated variant of CoCoA+ is proposed and shown to
possess a convergence rate of in terms of reducing
suboptimality. The analysis of this rate is also notable in that the
convergence rate bounds involve constants that, except in extreme cases, are
significantly reduced compared to those previously provided for CoCoA+. The
results of numerical experiments are provided to show that acceleration can
lead to significant performance gains
Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed SR1
In this paper, we present a scalable distributed implementation of the
Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that
a naive distributed implementation of S-LSR1 requires multiple rounds of
expensive communications at every iteration and thus is inefficient. We then
propose DS-LSR1, a communication-efficient variant that: (i) drastically
reduces the amount of data communicated at every iteration, (ii) has favorable
work-load balancing across nodes, and (iii) is matrix-free and inverse-free.
The proposed method scales well in terms of both the dimension of the problem
and the number of data points. Finally, we illustrate the empirical performance
of DS-LSR1 on a standard neural network training task.Comment: 24 pages, 14 figures, 6 table
Randomized block proximal damped Newton method for composite self-concordant minimization
In this paper we consider the composite self-concordant (CSC) minimization
problem, which minimizes the sum of a self-concordant function and a
(possibly nonsmooth) proper closed convex function . The CSC minimization is
the cornerstone of the path-following interior point methods for solving a
broad class of convex optimization problems. It has also found numerous
applications in machine learning. The proximal damped Newton (PDN) methods have
been well studied in the literature for solving this problem that enjoy a nice
iteration complexity. Given that at each iteration these methods typically
require evaluating or accessing the Hessian of and also need to solve a
proximal Newton subproblem, the cost per iteration can be prohibitively high
when applied to large-scale problems. Inspired by the recent success of block
coordinate descent methods, we propose a randomized block proximal damped
Newton (RBPDN) method for solving the CSC minimization. Compared to the PDN
methods, the computational cost per iteration of RBPDN is usually significantly
lower. The computational experiment on a class of regularized logistic
regression problems demonstrate that RBPDN is indeed promising in solving
large-scale CSC minimization problems. The convergence of RBPDN is also
analyzed in the paper. In particular, we show that RBPDN is globally convergent
when is Lipschitz continuous. It is also shown that RBPDN enjoys a local
linear convergence. Moreover, we show that for a class of including the
case where is Lipschitz differentiable, RBPDN enjoys a global linear
convergence. As a striking consequence, it shows that the classical damped
Newton methods [22,40] and the PDN [31] for such are globally linearly
convergent, which was previously unknown in the literature. Moreover, this
result can be used to sharpen the existing iteration complexity of these
methods.Comment: 29 page
Communication Complexity of Distributed Convex Learning and Optimization
We study the fundamental limits to communication-efficient distributed
methods for convex learning and optimization, under different assumptions on
the information available to individual machines, and the types of functions
considered. We identify cases where existing algorithms are already worst-case
optimal, as well as cases where room for further improvement is still possible.
Among other things, our results indicate that without similarity between the
local objective functions (due to statistical data similarity or otherwise)
many communication rounds may be required, even if the machines have unbounded
computational power
Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity
We study distributed optimization algorithms for minimizing the average of
convex functions. The applications include empirical risk minimization problems
in statistical machine learning where the datasets are large and have to be
stored on different machines. We design a distributed stochastic variance
reduced gradient algorithm that, under certain conditions on the condition
number, simultaneously achieves the optimal parallel runtime, amount of
communication and rounds of communication among all distributed first-order
methods up to constant factors. Our method and its accelerated extension also
outperform existing distributed algorithms in terms of the rounds of
communication as long as the condition number is not too large compared to the
size of data in each machine. We also prove a lower bound for the number of
rounds of communication for a broad class of distributed first-order methods
including the proposed algorithms in this paper. We show that our accelerated
distributed stochastic variance reduced gradient algorithm achieves this lower
bound so that it uses the fewest rounds of communication among all distributed
first-order algorithms.Comment: significant addition to both theory and experimental result
Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization
Stochastic gradient methods for machine learning and optimization problems
are usually analyzed assuming data points are sampled \emph{with} replacement.
In practice, however, sampling \emph{without} replacement is very common,
easier to implement in many cases, and often performs better. In this paper, we
provide competitive convergence guarantees for without-replacement sampling,
under various scenarios, for three types of algorithms: Any algorithm with
online regret guarantees, stochastic gradient descent, and SVRG. A useful
application of our SVRG analysis is a nearly-optimal algorithm for regularized
least squares in a distributed setting, in terms of both communication
complexity and runtime complexity, when the data is randomly partitioned and
the condition number can be as large as the data size per machine (up to
logarithmic factors). Our proof techniques combine ideas from stochastic
optimization, adversarial online learning, and transductive learning theory,
and can potentially be applied to other stochastic optimization and learning
problems.Comment: Fixed a few minor typos, and slightly tightened Corollary