33 research outputs found
Stochastic, distributed and federated optimization for machine learning
We study optimization algorithms for the finite sum problems frequently arising in machine
learning applications. First, we propose novel variants of stochastic gradient descent with
a variance reduction property that enables linear convergence for strongly convex objectives.
Second, we study distributed setting, in which the data describing the optimization problem
does not fit into a single computing node. In this case, traditional methods are inefficient,
as the communication costs inherent in distributed optimization become the bottleneck. We
propose a communication-efficient framework which iteratively forms local subproblems that can
be solved with arbitrary local optimization algorithms. Finally, we introduce the concept of
Federated Optimization/Learning, where we try to solve the machine learning problems without
having data stored in any centralized manner. The main motivation comes from industry when
handling user-generated data. The current prevalent practice is that companies collect vast
amounts of user data and store them in datacenters. An alternative we propose is not to collect
the data in first place, and instead occasionally use the computational power of users' devices to
solve the very same optimization problems, while alleviating privacy concerns at the same time.
In such setting, minimization of communication rounds is the primary goal, and we demonstrate
that solving the optimization problems in such circumstances is conceptually tractable
Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)
As models for nature language processing (NLP), computer vision (CV) and
recommendation systems (RS) require surging computation, a large number of
GPUs/TPUs are paralleled as a large batch (LB) to improve training throughput.
However, training such LB tasks often meets large generalization gap and
downgrades final precision, which limits enlarging the batch size. In this
work, we develop the variance reduced gradient descent technique (VRGD) based
on the gradient signal to noise ratio (GSNR) and apply it onto popular
optimizers such as SGD/Adam/LARS/LAMB. We carry out a theoretical analysis of
convergence rate to explain its fast training dynamics, and a generalization
analysis to demonstrate its smaller generalization gap on LB training.
Comprehensive experiments demonstrate that VRGD can accelerate training (), narrow generalization gap and improve final accuracy. We push the
batch size limit of BERT pretraining up to 128k/64k and DLRM to 512k without
noticeable accuracy loss. We improve ImageNet Top-1 accuracy at 96k by
than LARS. The generalization gap of BERT and ImageNet training is
significantly reduce by over .Comment: 25 pages, 5 figure
Fixed Point Iterations for Finite Sum Monotone Inclusions
This thesis studies two families of methods for finding zeros of finite sums of monotone operators, the first being variance-reduced stochastic gradient (VRSG) methods. This is a large family of algorithms that use random sampling to improve the convergence rate compared to more traditional approaches. We examine the optimal sampling distributions and their interaction with the epoch length. Specifically, we show that in methods like SAGA, where the epoch length is directly tied to the random sampling, the optimal sampling becomes more complex compared to for instance L-SVRG, where the epoch length can be chosen independently. We also show that biased VRSG estimates in the style of SAG are sensitive to the problem setting. More precisely, a significantly larger step-size can be used when the monotone operators are cocoercive gradients compared to when they just are cocoercive. This is noteworthy since the standard gradient descent is not affected by this change and the fact that the sensitivity to the problem assumption vanishes when the estimates are unbiased. The second set of methods we examine are deterministic operator splitting methods and we focus on frameworks for constructing and analyzing such splitting methods. One such framework is based on what we call nonlinear resolvents and we present a novel way of ensuring convergence of iterations of nonlinear resolvents by the means of a momentum term. This approach leads in many cases to cheaper per-iteration cost compared to a previously established projection approach. The framework covers many existing methods and we provide a new primal-dual method that uses an extra resolvent step as well as a general approach for adding momentum to any special case of our nonlinear resolvent method. We use a similar concept to the nonlinear resolvent to derive a representation of the entire class of frugal splitting operators, which are splitting operators that use exactly one direct or resolvent evaluation of each operator of the monotone inclusion problem. The representation reveals several new results regarding lifting numbers, existence of solution maps, and parallelizability of the forward/backward evaluations. We show that the minimal lifting is n − 1 − f where n is the number of monotone operators and f is the number of direct evaluations in the splitting. A new convergent and parallelizable frugal splitting operator with minimal lifting is also presented