98,395 research outputs found
Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search
Stochastic gradient descent (SGD) is the simplest deep learning optimizer
with which to train deep neural networks. While SGD can use various learning
rates, such as constant or diminishing rates, the previous numerical results
showed that SGD performs better than other deep learning optimizers using when
it uses learning rates given by line search methods. In this paper, we perform
a convergence analysis on SGD with a learning rate given by an Armijo line
search for nonconvex optimization. The analysis indicates that the upper bound
of the expectation of the squared norm of the full gradient becomes small when
the number of steps and the batch size are large. Next, we show that, for SGD
with the Armijo-line-search learning rate, the number of steps needed for
nonconvex optimization is a monotone decreasing convex function of the batch
size; that is, the number of steps needed for nonconvex optimization decreases
as the batch size increases. Furthermore, we show that the stochastic
first-order oracle (SFO) complexity, which is the stochastic gradient
computation cost, is a convex function of the batch size; that is, there exists
a critical batch size that minimizes the SFO complexity. Finally, we provide
numerical results that support our theoretical results. The numerical results
indicate that the number of steps needed for training deep neural networks
decreases as the batch size increases and that there exist the critical batch
sizes that can be estimated from the theoretical results
- …