1,686 research outputs found
Inefficiency of K-FAC for Large Batch Size Training
In stochastic optimization, using large batch sizes during training can
leverage parallel resources to produce faster wall-clock training times per
training epoch. However, for both training loss and testing error, recent
results analyzing large batch Stochastic Gradient Descent (SGD) have found
sharp diminishing returns, beyond a certain critical batch size. In the hopes
of addressing this, it has been suggested that the Kronecker-Factored
Approximate Curvature (\mbox{K-FAC}) method allows for greater scalability to
large batch sizes, for non-convex machine learning problems such as neural
network optimization, as well as greater robustness to variation in model
hyperparameters. Here, we perform a detailed empirical analysis of large batch
size training %of these two hypotheses, for both \mbox{K-FAC} and SGD,
evaluating performance in terms of both wall-clock time and aggregate
computational cost. Our main results are twofold: first, we find that both
\mbox{K-FAC} and SGD doesn't have ideal scalability behavior beyond a certain
batch size, and that \mbox{K-FAC} does not exhibit improved large-batch
scalability behavior, as compared to SGD; and second, we find that
\mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers
from similar hyperparameter sensitivity behavior as does SGD. We discuss
extensive results using ResNet and AlexNet on \mbox{CIFAR-10} and SVHN,
respectively, as well as more general implications of our findings
- …