With the rise in manipulated media, deepfake detection has become an
imperative task for preserving the authenticity of digital content. In this
paper, we present a novel multi-modal audio-video framework designed to
concurrently process audio and video inputs for deepfake detection tasks. Our
model capitalizes on lip synchronization with input audio through a
cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16
network. Subsequently, a transformer encoder network is employed to perform
facial self-attention. We conduct multiple ablation studies highlighting
different strengths of our approach. Our multi-modal methodology outperforms
state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and
per-video AUC scores