Cutting-edge research in facial expression recognition (FER) currently favors
the utilization of convolutional neural networks (CNNs) backbone which is
supervisedly pre-trained on face recognition datasets for feature extraction.
However, due to the vast scale of face recognition datasets and the high cost
associated with collecting facial labels, this pre-training paradigm incurs
significant expenses. Towards this end, we propose to pre-train vision
Transformers (ViTs) through a self-supervised approach on a mid-scale general
image dataset. In addition, when compared with the domain disparity existing
between face datasets and FER datasets, the divergence between general datasets
and FER datasets is more pronounced. Therefore, we propose a contrastive
fine-tuning approach to effectively mitigate this domain disparity.
Specifically, we introduce a novel FER training paradigm named Mask Image
pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we
pre-train the ViT via masked image reconstruction on general images.
Subsequently, in the fine-tuning stage, we introduce a mix-supervised
contrastive learning process, which enhances the model with a more extensive
range of positive samples by the mixing strategy. Through extensive experiments
conducted on three benchmark datasets, we demonstrate that our MIMIC
outperforms the previous training paradigm, showing its capability to learn
better representations. Remarkably, the results indicate that the vanilla ViT
can achieve impressive performance without the need for intricate,
auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC
exhibits no performance saturation and is superior to the current
state-of-the-art methods