939 research outputs found
Stochastic Learning under Random Reshuffling with Constant Step-sizes
In empirical risk optimization, it has been observed that stochastic gradient
implementations that rely on random reshuffling of the data achieve better
performance than implementations that rely on sampling the data uniformly.
Recent works have pursued justifications for this behavior by examining the
convergence rate of the learning process under diminishing step-sizes. This
work focuses on the constant step-size case and strongly convex loss function.
In this case, convergence is guaranteed to a small neighborhood of the
optimizer albeit at a linear rate. The analysis establishes analytically that
random reshuffling outperforms uniform sampling by showing explicitly that
iterates approach a smaller neighborhood of size around the
minimizer rather than . Furthermore, we derive an analytical expression
for the steady-state mean-square-error performance of the algorithm, which
helps clarify in greater detail the differences between sampling with and
without replacement. We also explain the periodic behavior that is observed in
random reshuffling implementations
Variance-Reduced Stochastic Learning by Networked Agents under Random Reshuffling
A new amortized variance-reduced gradient (AVRG) algorithm was developed in
\cite{ying2017convergence}, which has constant storage requirement in
comparison to SAGA and balanced gradient computations in comparison to SVRG.
One key advantage of the AVRG strategy is its amenability to decentralized
implementations. In this work, we show how AVRG can be extended to the network
case where multiple learning agents are assumed to be connected by a graph
topology. In this scenario, each agent observes data that is spatially
distributed and all agents are only allowed to communicate with direct
neighbors. Moreover, the amount of data observed by the individual agents may
differ drastically. For such situations, the balanced gradient computation
property of AVRG becomes a real advantage in reducing idle time caused by
unbalanced local data storage requirements, which is characteristic of other
reduced-variance gradient algorithms. The resulting diffusion-AVRG algorithm is
shown to have linear convergence to the exact solution, and is much more memory
efficient than other alternative algorithms. In addition, we propose a
mini-batch strategy to balance the communication and computation efficiency for
diffusion-AVRG. When a proper batch size is employed, it is observed in
simulations that diffusion-AVRG is more computationally efficient than exact
diffusion or EXTRA while maintaining almost the same communication efficiency.Comment: 23 pages, 12 figures, submitted for publicatio
Supervised Learning Under Distributed Features
This work studies the problem of learning under both large datasets and
large-dimensional feature space scenarios. The feature information is assumed
to be spread across agents in a network, where each agent observes some of the
features. Through local cooperation, the agents are supposed to interact with
each other to solve an inference problem and converge towards the global
minimizer of an empirical risk. We study this problem exclusively in the primal
domain, and propose new and effective distributed solutions with guaranteed
convergence to the minimizer with linear rate under strong convexity. This is
achieved by combining a dynamic diffusion construction, a pipeline strategy,
and variance-reduced techniques. Simulation results illustrate the conclusions
Distributed stochastic proximal algorithm with random reshuffling for non-smooth finite-sum optimization
The non-smooth finite-sum minimization is a fundamental problem in machine
learning. This paper develops a distributed stochastic proximal-gradient
algorithm with random reshuffling to solve the finite-sum minimization over
time-varying multi-agent networks. The objective function is a sum of
differentiable convex functions and non-smooth regularization. Each agent in
the network updates local variables with a constant step-size by local
information and cooperates to seek an optimal solution. We prove that local
variable estimates generated by the proposed algorithm achieve consensus and
are attracted to a neighborhood of the optimal solution in expectation with an
convergence rate, where is
the total number of iterations. Finally, some comparative simulations are
provided to verify the convergence performance of the proposed algorithm.Comment: 15 pages, 7 figure
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for least-squares regression
with potentially several passes over the data. While several passes have been
widely reported to perform practically better in terms of predictive
performance on unseen data, the existing theoretical analysis of SGD suggests
that a single pass is statistically optimal. While this is true for
low-dimensional easy problems, we show that for hard problems, multiple passes
lead to statistically optimal predictions while single pass does not; we also
show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinite-dimensional models and notions typically associated to kernel methods,
namely, the decay of eigenvalues of the covariance matrix of the features and
the complexity of the optimal predictor as measured through the covariance
matrix. We illustrate our results on synthetic experiments with non-linear
kernel methods and on a classical benchmark with a linear model
- …