Data augmentation has become a standard component of vision pre-trained
models to capture the invariance between augmented views. In practice,
augmentation techniques that mask regions of a sample with zero/mean values or
patches from other samples are commonly employed in pre-trained models with
self-/semi-/fully-supervised contrastive losses. However, the underlying
mechanism behind the effectiveness of these augmentation techniques remains
poorly explored. To investigate the problems, we conduct an empirical study to
quantify how data augmentation affects performance. Concretely, we apply 4
types of data augmentations termed with Random Erasing, CutOut, CutMix and
MixUp to a series of self-/semi-/fully- supervised pre-trained models. We
report their performance on vision tasks such as image classification, object
detection, instance segmentation, and semantic segmentation. We then explicitly
evaluate the invariance and diversity of the feature embedding. We observe
that: 1) Masking regions of the images decreases the invariance of the learned
feature embedding while providing a more considerable diversity. 2) Manual
annotations do not change the invariance or diversity of the learned feature
embedding. 3) The MixUp approach improves the diversity significantly, with
only a marginal decrease in terms of the invariance