Obtaining versions of deep neural networks that are both highly-accurate and
highly-sparse is one of the main challenges in the area of model compression,
and several high-performance pruning techniques have been investigated by the
community. Yet, much less is known about the interaction between sparsity and
the standard stochastic optimization techniques used for training sparse
networks, and most existing work uses standard dense schedules and
hyperparameters for training sparse networks. In this work, we examine the
impact of high sparsity on model training using the standard computer vision
and natural language processing sparsity benchmarks. We begin by showing that
using standard dense training recipes for sparse training is suboptimal, and
results in under-training. We provide new approaches for mitigating this issue
for both sparse pre-training of vision models (e.g. ResNet50/ImageNet) and
sparse fine-tuning of language models (e.g. BERT/GLUE), achieving
state-of-the-art results in both settings in the high-sparsity regime, and
providing detailed analyses for the difficulty of sparse training in both
scenarios. Our work sets a new threshold in terms of the accuracies that can be
achieved under high sparsity, and should inspire further research into
improving sparse model training, to reach higher accuracies under high
sparsity, but also to do so efficiently