Large, pre-trained models are problematic to use in resource constrained
applications. Fortunately, task-aware structured pruning methods offer a
solution. These approaches reduce model size by dropping structural units like
layers and attention heads in a manner that takes into account the end-task.
However, these pruning algorithms require more task-specific data than is
typically available. We propose a framework which combines structured pruning
with transfer learning to reduce the need for task-specific data. Our empirical
results answer questions such as: How should the two tasks be coupled? What
parameters should be transferred? And, when during training should transfer
learning be introduced? Leveraging these insights, we demonstrate that our
framework results in pruned models with improved generalization over strong
baselines.Comment: 8 pages, 7 figures and 3 table