DeepFake based digital facial forgery is threatening public media security,
especially when lip manipulation has been used in talking face generation, and
the difficulty of fake video detection is further improved. By only changing
lip shape to match the given speech, the facial features of identity are hard
to be discriminated in such fake talking face videos. Together with the lack of
attention on audio stream as the prior knowledge, the detection failure of fake
talking face videos also becomes inevitable. It's found that the optical flow
of the fake talking face video is disordered especially in the lip region while
the optical flow of the real video changes regularly, which means the motion
feature from optical flow is useful to capture manipulation cues. In this
study, a fake talking face detection network (FTFDNet) is proposed by
incorporating visual, audio and motion features using an efficient cross-modal
fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism
(AVAM) is proposed to discover more informative features, which can be
seamlessly integrated into any audio-visual CNN architecture by modularization.
With the additional AVAM, the proposed FTFDNet is able to achieve a better
detection performance than other state-of-the-art DeepFake video detection
methods not only on the established fake talking face detection dataset (FTFDD)
but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).Comment: arXiv admin note: substantial text overlap with arXiv:2203.0517