Massive data is often considered essential for deep learning applications,
but it also incurs significant computational and infrastructural costs.
Therefore, dataset pruning (DP) has emerged as an effective way to improve data
efficiency by identifying and removing redundant training samples without
sacrificing performance. In this work, we aim to address the problem of DP for
transfer learning, i.e., how to prune a source dataset for improved pretraining
efficiency and lossless finetuning accuracy on downstream target tasks. To our
best knowledge, the problem of DP for transfer learning remains open, as
previous studies have primarily addressed DP and transfer learning as separate
problems. By contrast, we establish a unified viewpoint to integrate DP with
transfer learning and find that existing DP methods are not suitable for the
transfer learning paradigm. We then propose two new DP methods, label mapping
and feature mapping, for supervised and self-supervised pretraining settings
respectively, by revisiting the DP problem through the lens of source-target
domain mapping. Furthermore, we demonstrate the effectiveness of our approach
on numerous transfer learning tasks. We show that source data classes can be
pruned by up to 40% ~ 80% without sacrificing downstream performance, resulting
in a significant 2 ~ 5 times speed-up during the pretraining stage. Besides,
our proposal exhibits broad applicability and can improve other computationally
intensive transfer learning techniques, such as adversarial pretraining. Codes
are available at https://github.com/OPTML-Group/DP4TL.Comment: Thirty-seventh Conference on Neural Information Processing Systems
(NeurIPS 2023