DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio
  Cross-Attention and Facial Self-Attention

Bera, Aniket; Kharel, Aaditya; Paranjape, Manas

DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

Authors: Aniket Bera
Aaditya Kharel
Manas Paranjape
Publication date: 12 September 2023
Publisher

Abstract

With the rise in manipulated media, deepfake detection has become an imperative task for preserving the authenticity of digital content. In this paper, we present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network. Subsequently, a transformer encoder network is employed to perform facial self-attention. We conduct multiple ablation studies highlighting different strengths of our approach. Our multi-modal methodology outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and per-video AUC scores

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2309.06511

Last time updated on 08/10/2023