29,416 research outputs found
Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality
Stochastic Gradient Descent (SGD) is one of the simplest and most popular
algorithms in modern statistical and machine learning due to its computational
and memory efficiency. Various averaging schemes have been proposed to
accelerate the convergence of SGD in different settings. In this paper, we
explore a general averaging scheme for SGD. Specifically, we establish the
asymptotic normality of a broad range of weighted averaged SGD solutions and
provide asymptotically valid online inference approaches. Furthermore, we
propose an adaptive averaging scheme that exhibits both optimal statistical
rate and favorable non-asymptotic convergence, drawing insights from the
optimal weight for the linear model in terms of non-asymptotic mean squared
error (MSE)
Recommended from our members
On variants of stochastic gradient descent
Stochastic Gradient Descent (SGD) has played a crucial role in the success of modern machine learning methods. The popularity of SGD arises due to its ease of implementation, low memory and computational requirements, and applicability to a wide variety of optimization problems. However, SGD suffers from numerous issues; chief amongst them are high variance, slow rate of convergence, poor generalization, non-robustness to outliers, and poor performance for imbalanced classification. In this thesis, we propose variants of stochastic gradient descent, to tackle one or more of these issues for different problem settings.
In the first chapter, we analyze the trade-off between variance and complexity to improve the convergence rate of SGD. A common alternative in the literature to SGD is Stochastic Variance Reduced Gradient (SVRG), which achieves linear convergence. However, SVRG involves the computation of a full gradient every few epochs, which is often intractable. We propose the Cheap Stochastic Variance Reduced Gradient (CheapSVRG) algorithm that attains linear convergence up to a neighborhood around the optimum without requiring a full gradient computation step.
In the second chapter, we aim to compare the generalization capabilities of adaptive and non-adaptive methods for over-parameterized linear regression. Of the many possible solutions, SGD tends to gravitate towards the solution with minimum l2-norm while adaptive methods do not. We provide specific conditions on the pre-conditioner matrices under which a subclass of adaptive methods has the same generalization guarantees as SGD for over-parameterized linear regression. With synthetic examples and real data, we show that minimum norm solutions are not an excellent certificate to guarantee better generalization.
In the third chapter, we propose a simple variant of SGD that guarantees robustness. Instead of considering SGD with one sample, we take a mini-batch and choose the sample with the lowest loss. For the noiseless framework with and without outliers, we provide conditions for the convergence of MKL-SGD to a provably better solution than SGD in the worst case. We also perform the standard rate of convergence analysis for both noiseless and noisy settings.
In the final chapter, we tackle the challenges introduced by imbalanced class distribution in SGD. In place of using all the samples to update the parameter, our proposed Balancing SGD (B-SGD) algorithm rejects samples with low loss as they are redundant and do not play a role in determining the separating hyperplane. Imposing this label-dependent loss-based thresholding scheme on incoming samples allows us to improve the rate of convergence and achieve better generalization.Electrical and Computer Engineerin
- …