31 research outputs found
SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization
We consider distributed learning scenarios where M machines interact with a
parameter server along several communication rounds in order to minimize a
joint objective function. Focusing on the heterogeneous case, where different
machines may draw samples from different data-distributions, we design the
first local update method that provably benefits over the two most prominent
distributed baselines: namely Minibatch-SGD and Local-SGD. Key to our approach
is a slow querying technique that we customize to the distributed setting,
which in turn enables a better mitigation of the bias caused by local updates
Online Variance Reduction for Stochastic Optimization
Modern stochastic optimization methods often rely on uniform sampling which
is agnostic to the underlying characteristics of the data. This might degrade
the convergence by yielding estimates that suffer from a high variance. A
possible remedy is to employ non-uniform importance sampling techniques, which
take the structure of the dataset into account. In this work, we investigate a
recently proposed setting which poses variance reduction as an online
optimization problem with bandit feedback. We devise a novel and efficient
algorithm for this setting that finds a sequence of importance sampling
distributions competitive with the best fixed distribution in hindsight, the
first result of this kind. While we present our method for sampling datapoints,
it naturally extends to selecting coordinates or even blocks of thereof.
Empirical validations underline the benefits of our method in several settings.Comment: COLT 201
-SGD: Stable Stochastic Optimization via a Double Momentum Mechanism
We consider stochastic convex optimization problems where the objective is an
expectation over smooth functions. For this setting we suggest a novel gradient
estimate that combines two recent mechanism that are related to notion of
momentum. Then, we design an SGD-style algorithm as well as an accelerated
version that make use of this new estimator, and demonstrate the robustness of
these new approaches to the choice of the learning rate. Concretely, we show
that these approaches obtain the optimal convergence rates for both noiseless
and noisy case with the same choice of fixed learning rate. Moreover, for the
noisy case we show that these approaches achieve the same optimal bound for a
very wide range of learning rates
Logistic Regression: Tight Bounds for Stochastic and Online Optimization
The logistic loss function is often advocated in machine learning and
statistics as a smooth and strictly convex surrogate for the 0-1 loss. In this
paper we investigate the question of whether these smoothness and convexity
properties make the logistic loss preferable to other widely considered options
such as the hinge loss. We show that in contrast to known asymptotic bounds, as
long as the number of prediction/optimization iterations is sub exponential,
the logistic loss provides no improvement over a generic non-smooth loss
function such as the hinge loss. In particular we show that the convergence
rate of stochastic logistic optimization is bounded from below by a polynomial
in the diameter of the decision set and the number of prediction iterations,
and provide a matching tight upper bound. This resolves the COLT open problem
of McMahan and Streeter (2012)
Beyond Convexity: Stochastic Quasi-Convex Optimization
Stochastic convex optimization is a basic and well studied primitive in
machine learning. It is well known that convex and Lipschitz functions can be
minimized efficiently using Stochastic Gradient Descent (SGD). The Normalized
Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which
updates according to the direction of the gradients, rather than the gradients
themselves. In this paper we analyze a stochastic version of NGD and prove its
convergence to a global minimum for a wider class of functions: we require the
functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens
the con- cept of unimodality to multidimensions and allows for certain types of
saddle points, which are a known hurdle for first-order optimization methods
such as gradient descent. Locally-Lipschitz functions are only required to be
Lipschitz in a small region around the optimum. This assumption circumvents
gradient explosion, which is another known hurdle for gradient descent
variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic
normalized gradient descent algorithm provably requires a minimal minibatch
size
On Graduated Optimization for Stochastic Non-Convex Problems
The graduated optimization approach, also known as the continuation method,
is a popular heuristic to solving non-convex problems that has received renewed
interest over the last decade. Despite its popularity, very little is known in
terms of theoretical convergence analysis. In this paper we describe a new
first-order algorithm based on graduated optimiza- tion and analyze its
performance. We characterize a parameterized family of non- convex functions
for which this algorithm provably converges to a global optimum. In particular,
we prove that the algorithm converges to an {\epsilon}-approximate solution
within O(1/\epsilon^2) gradient-based steps. We extend our algorithm and
analysis to the setting of stochastic non-convex optimization with noisy
gradient feedback, attaining the same convergence rate. Additionally, we
discuss the setting of zero-order optimization, and devise a a variant of our
algorithm which converges at rate of O(d^2/\epsilon^4).Comment: 17 page