48 research outputs found
SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines
Nowadays, as more and more systems achieve good performance in traditional
voice conversion (VC) tasks, people's attention gradually turns to VC tasks
under extreme conditions. In this paper, we propose a novel method for
zero-shot voice conversion. We aim to obtain intermediate representations for
speaker-content disentanglement of speech to better remove speaker information
and get pure content information. Accordingly, our proposed framework contains
a module that removes the speaker information from the acoustic feature of the
source speaker. Moreover, speaker information control is added to our system to
maintain the voice cloning performance. The proposed system is evaluated by
subjective and objective metrics. Results show that our proposed system
significantly reduces the trade-off problem in zero-shot voice conversion,
while it also manages to have high spoofing power to the speaker verification
system
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202