12 research outputs found
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
We consider decentralized stochastic optimization with the objective function
(e.g. data samples for machine learning task) being distributed over
machines that can only communicate to their neighbors on a fixed communication
graph. To reduce the communication bottleneck, the nodes compress (e.g.
quantize or sparsify) their model updates. We cover both unbiased and biased
compression operators with quality denoted by (
meaning no compression). We (i) propose a novel gossip-based stochastic
gradient descent algorithm, CHOCO-SGD, that converges at rate
for strongly convex
objectives, where denotes the number of iterations and the
eigengap of the connectivity matrix. Despite compression quality and network
connectivity affecting the higher order terms, the first term in the rate,
, is the same as for the centralized baseline with exact
communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the
average consensus problem that converges in time
for accuracy . This is (up to our knowledge) the first gossip algorithm that supports
arbitrary compressed messages for and still exhibits linear
convergence. We (iii) show in experiments that both of our algorithms do
outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce
communication by at least two orders of magnitudes
Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders
Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing
neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being
popular choices for cycling through random or single permutations of the
training data. However, the convergence properties of these algorithms in the
non-convex case are not fully understood. Existing results suggest that, in
realistic training scenarios where the number of epochs is smaller than the
training set size, RR may perform worse than SGD.
In this paper, we analyze a general SGD algorithm that allows for arbitrary
data orderings and show improved convergence rates for non-convex functions.
Specifically, our analysis reveals that SGD with random and single shuffling is
always faster or at least as good as classical SGD with replacement, regardless
of the number of iterations. Overall, our study highlights the benefits of
using SGD with random/single shuffling and provides new insights into its
convergence properties for non-convex optimization
Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy
We study gradient descent under linearly correlated noise. Our work is
motivated by recent practical methods for optimization with differential
privacy (DP), such as DP-FTRL, which achieve strong performance in settings
where privacy amplification techniques are infeasible (such as in federated
learning). These methods inject privacy noise through a matrix factorization
mechanism, making the noise linearly correlated over iterations. We propose a
simplified setting that distills key facets of these methods and isolates the
impact of linearly correlated noise. We analyze the behavior of gradient
descent in this setting, for both convex and non-convex functions. Our analysis
is demonstrably tighter than prior work and recovers multiple important special
cases exactly (including anticorrelated perturbed gradient descent). We use our
results to develop new, effective matrix factorizations for differentially
private optimization, and highlight the benefits of these factorizations
theoretically and empirically
Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning
We study the asynchronous stochastic gradient descent algorithm for distributed training over n workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay τ_{max} and show that an ϵ-stationary point is reached after O(σ^2ϵ^{−2}+τ_{max}ϵ^{−1}) iterations, where σ denotes the variance of stochastic gradients.
In this work (i) we obtain a tighter convergence rate of O(σ^2ϵ^{−2}+ √ τ_{max}τ_{avg}ϵ^{−1}) without any change in the algorithm where τ_{avg} is the average delay, which can be significantly smaller than τ_{max}. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of O(σ^2ϵ^{−2}+τ_{avg}ϵ^{−1}), and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker
Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees
Gradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value
. It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of
and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds
and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments
Decentralized Local Stochastic Extra-Gradient for Variational Inequalities
We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between workers. We extend the stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly monotone, monotone, and non-monotone settings when a Minty solution exists. The provided rates explicitly exhibit the dependence on network characteristics (e.g., mixing time), iteration counter, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to training Deep Generative Adversarial Networks (GANs) for which decentralized training has been reported to be extremely challenging. In experiments for decentralized training of GANs we demonstrate the effectiveness of our proposed approach
Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees
International audienceGradient clipping is a popular modification to standard (stochastic) gradient descent, at every iteration limiting the gradient norm to a certain value . It is widely used for example for stabilizing the training of deep learning models (Goodfellow et al., 2016), or for enforcing differential privacy (Abadi et al., 2016). Despite popularity and simplicity of the clipping mechanism, its convergence guarantees often require specific values of and strong noise assumptions. In this paper, we give convergence guarantees that show precise dependence on arbitrary clipping thresholds and show that our guarantees are tight with both deterministic and stochastic gradients. In particular, we show that (i) for deterministic gradient descent, the clipping threshold only affects the higher-order terms of convergence, (ii) in the stochastic setting convergence to the true optimum cannot be guaranteed under the standard noise assumption, even under arbitrary small step-sizes. We give matching upper and lower bounds for convergence of the gradient norm when running clipped SGD, and illustrate these results with experiments