Vision Transformers (ViTs) have demonstrated powerful representation ability
in various visual tasks thanks to their intrinsic data-hungry nature. However,
we unexpectedly find that ViTs perform vulnerably when applied to face
recognition (FR) scenarios with extremely large datasets. We investigate the
reasons for this phenomenon and discover that the existing data augmentation
approach and hard sample mining strategy are incompatible with ViTs-based FR
backbone due to the lack of tailored consideration on preserving face
structural information and leveraging each local token information. To remedy
these problems, this paper proposes a superior FR model called TransFace, which
employs a patch-level data augmentation strategy named DPAP and a hard sample
mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude
information of dominant patches to expand sample diversity, which effectively
alleviates the overfitting problem in ViTs. EHSM utilizes the information
entropy in the local tokens to dynamically adjust the importance weight of easy
and hard samples during training, leading to a more stable prediction.
Experiments on several benchmarks demonstrate the superiority of our TransFace.
Code and models are available at https://github.com/DanJun6737/TransFace.Comment: Accepted by ICCV 202