5 research outputs found
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
A Unified Framework for Modality-Agnostic Deepfakes Detection
As AI-generated content (AIGC) thrives, deepfakes have expanded from
single-modality falsification to cross-modal fake content creation, where
either audio or visual components can be manipulated. While using two unimodal
detectors can detect audio-visual deepfakes, cross-modal forgery clues could be
overlooked. Existing multimodal deepfake detection methods typically establish
correspondence between the audio and visual modalities for binary real/fake
classification, and require the co-occurrence of both modalities. However, in
real-world multi-modal applications, missing modality scenarios may occur where
either modality is unavailable. In such cases, audio-visual detection methods
are less practical than two independent unimodal methods. Consequently, the
detector can not always obtain the number or type of manipulated modalities
beforehand, necessitating a fake-modality-agnostic audio-visual detector. In
this work, we introduce a comprehensive framework that is agnostic to fake
modalities, which facilitates the identification of multimodal deepfakes and
handles situations with missing modalities, regardless of the manipulations
embedded in audio, video, or even cross-modal forms. To enhance the modeling of
cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as
a preliminary task. This efficiently extracts speech correlations across
modalities, a feature challenging for deepfakes to replicate. Additionally, we
propose a dual-label detection approach that follows the structure of AVSR to
support the independent detection of each modality. Extensive experiments on
three audio-visual datasets show that our scheme outperforms state-of-the-art
detection methods with promising performance on modality-agnostic audio/video
deepfakes.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl