197 research outputs found
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
We consider the stochastic approximation problem where a convex function has
to be minimized, given only the knowledge of unbiased estimates of its
gradients at certain points, a framework which includes machine learning
methods based on the minimization of the empirical risk. We focus on problems
without strong convexity, for which all previously known algorithms achieve a
convergence rate for function values of O(1/n^{1/2}). We consider and analyze
two algorithms that achieve a rate of O(1/n) for classical supervised learning
problems. For least-squares regression, we show that averaged stochastic
gradient descent with constant step-size achieves the desired rate. For
logistic regression, this is achieved by a simple novel stochastic gradient
algorithm that (a) constructs successive local quadratic approximations of the
loss functions, while (b) preserving the same running time complexity as
stochastic gradient descent. For these algorithms, we provide a non-asymptotic
analysis of the generalization error (in expectation, and also in high
probability for least-squares), and run extensive experiments on standard
machine learning benchmarks showing that they often outperform existing
approaches
On the Stability Analysis of Open Federated Learning Systems
We consider the open federated learning (FL) systems, where clients may join
and/or leave the system during the FL process. Given the variability of the
number of present clients, convergence to a fixed model cannot be guaranteed in
open systems. Instead, we resort to a new performance metric that we term the
stability of open FL systems, which quantifies the magnitude of the learned
model in open systems. Under the assumption that local clients' functions are
strongly convex and smooth, we theoretically quantify the radius of stability
for two FL algorithms, namely local SGD and local Adam. We observe that this
radius relies on several key parameters, including the function condition
number as well as the variance of the stochastic gradient. Our theoretical
results are further verified by numerical simulations on both synthetic and
real-world benchmark data-sets
Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
We consider the optimization of a quadratic objective function whose
gradients are only accessible through a stochastic oracle that returns the
gradient at any given point plus a zero-mean finite variance random error. We
present the first algorithm that achieves jointly the optimal prediction error
rates for least-squares regression, both in terms of forgetting of initial
conditions in O(1/n 2), and in terms of dependence on the noise and dimension d
of the problem, as O(d/n). Our new algorithm is based on averaged accelerated
regularized gradient descent, and may also be analyzed through finer
assumptions on initial conditions and the Hessian matrix, leading to
dimension-free quantities that may still be small while the " optimal " terms
above are large. In order to characterize the tightness of these new bounds, we
consider an application to non-parametric regression and use the known lower
bounds on the statistical performance (without computational limits), which
happen to match our bounds obtained from a single pass on the data and thus
show optimality of our algorithm in a wide variety of particular trade-offs
between bias and variance
Linearly Convergent Frank-Wolfe with Backtracking Line-Search
Structured constraints in Machine Learning have recently brought the
Frank-Wolfe (FW) family of algorithms back in the spotlight. While the
classical FW algorithm has poor local convergence properties, the Away-steps
and Pairwise FW variants have emerged as improved variants with faster
convergence. However, these improved variants suffer from two practical
limitations: they require at each iteration to solve a 1-dimensional
minimization problem to set the step-size and also require the Frank-Wolfe
linear subproblems to be solved exactly. In this paper, we propose variants of
Away-steps and Pairwise FW that lift both restrictions simultaneously. The
proposed methods set the step-size based on a sufficient decrease condition,
and do not require prior knowledge of the objective. Furthermore, they inherit
all the favorable convergence properties of the exact line-search version,
including linear convergence for strongly convex functions over polytopes.
Benchmarks on different machine learning problems illustrate large performance
gains of the proposed variants
- …