2,076 research outputs found
Inefficiency of K-FAC for Large Batch Size Training
In stochastic optimization, using large batch sizes during training can
leverage parallel resources to produce faster wall-clock training times per
training epoch. However, for both training loss and testing error, recent
results analyzing large batch Stochastic Gradient Descent (SGD) have found
sharp diminishing returns, beyond a certain critical batch size. In the hopes
of addressing this, it has been suggested that the Kronecker-Factored
Approximate Curvature (\mbox{K-FAC}) method allows for greater scalability to
large batch sizes, for non-convex machine learning problems such as neural
network optimization, as well as greater robustness to variation in model
hyperparameters. Here, we perform a detailed empirical analysis of large batch
size training %of these two hypotheses, for both \mbox{K-FAC} and SGD,
evaluating performance in terms of both wall-clock time and aggregate
computational cost. Our main results are twofold: first, we find that both
\mbox{K-FAC} and SGD doesn't have ideal scalability behavior beyond a certain
batch size, and that \mbox{K-FAC} does not exhibit improved large-batch
scalability behavior, as compared to SGD; and second, we find that
\mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers
from similar hyperparameter sensitivity behavior as does SGD. We discuss
extensive results using ResNet and AlexNet on \mbox{CIFAR-10} and SVHN,
respectively, as well as more general implications of our findings
Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training
While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy
to spin-up large clusters for training ML models, it can also lead to
ballooning costs. The 100s of virtual machine sizes provided by cloud platforms
also makes it extremely challenging to select the ``right'' cloud cluster
configuration for training. Furthermore, the training time and cost of
distributed model training is highly sensitive to the cluster configurations,
and presents a large and complex tradeoff-space.
In this paper, we develop principled and practical techniques for optimizing
the training time and cost of distributed ML model training on the cloud. Our
key insight is that both parallel and statistical efficiency must be considered
when selecting the optimum job configuration parameters such as the number of
workers and the batch size. By combining conventional parallel scaling concepts
and new insights into SGD noise, our models accurately estimate the time and
cost on different cluster configurations with < 5% error. Using the repetitive
nature of training and our models, we can search for optimum cloud
configurations in a black-box, online manner. Our approach reduces training
times by 2 times and costs more more than 50%. Compared to an oracle-based
approach, our performance models are accurate to within 2% such that the search
imposes an overhead of just 10%
- …