With the advancement in face manipulation technologies, the importance of
face forgery detection in protecting authentication integrity becomes
increasingly evident. Previous Vision Transformer (ViT)-based detectors have
demonstrated subpar performance in cross-database evaluations, primarily
because fully fine-tuning with limited Deepfake data often leads to forgetting
pre-trained knowledge and over-fitting to data-specific ones. To circumvent
these issues, we propose a novel Forgery-aware Adaptive Vision Transformer
(FA-ViT). In FA-ViT, the vanilla ViT's parameters are frozen to preserve its
pre-trained knowledge, while two specially designed components, the Local-aware
Forgery Injector (LFI) and the Global-aware Forgery Adaptor (GFA), are employed
to adapt forgery-related knowledge. our proposed FA-ViT effectively combines
these two different types of knowledge to form the general forgery features for
detecting Deepfakes. Specifically, LFI captures local discriminative
information and incorporates these information into ViT via
Neighborhood-Preserving Cross Attention (NPCA). Simultaneously, GFA learns
adaptive knowledge in the self-attention layer, bridging the gap between the
two different domain. Furthermore, we design a novel Single Domain Pairwise
Learning (SDPL) to facilitate fine-grained information learning in FA-ViT. The
extensive experiments demonstrate that our FA-ViT achieves state-of-the-art
performance in cross-dataset evaluation and cross-manipulation scenarios, and
improves the robustness against unseen perturbations