On the Connection between Pre-training Data Diversity and Fine-tuning
  Robustness

Farhadi, Ali; Nguyen, Thao; Oh, Sewoong; Ramanujan, Vivek; Schmidt, Ludwig

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Authors: Ali Farhadi
Thao Nguyen
Sewoong Oh
Vivek Ramanujan
Ludwig Schmidt
Publication date: 24 July 2023
Publisher

Abstract

Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a fine-tuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness (Taori et al., 2020) is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4x while increasing the number of images per class by 4x (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCam-WILDS distribution shift as a test for downstream robustness

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2307.12532

Last time updated on 28/07/2023