5,802 research outputs found
Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences
Speaking rate refers to the average number of phonemes within some unit time,
while the rhythmic patterns refer to duration distributions for realizations of
different phonemes within different phonetic structures. Both are key
components of prosody in speech, which is different for different speakers.
Models like cycle-consistent adversarial network (Cycle-GAN) and variational
auto-encoder (VAE) have been successfully applied to voice conversion tasks
without parallel data. However, due to the neural network architectures and
feature vectors chosen for these approaches, the length of the predicted
utterance has to be fixed to that of the input utterance, which limits the
flexibility in mimicking the speaking rates and rhythmic patterns for the
target speaker. On the other hand, sequence-to-sequence learning model was used
to remove the above length constraint, but parallel training data are needed.
In this paper, we propose an approach utilizing sequence-to-sequence model
trained with unsupervised Cycle-GAN to perform the transformation between the
phoneme posteriorgram sequences for different speakers. In this way, the length
constraint mentioned above is removed to offer rhythm-flexible voice conversion
without requiring parallel data. Preliminary evaluation on two datasets showed
very encouraging results.Comment: 8 pages, 6 figures, Submitted to SLT 201
AdVerb: Visually Guided Audio Dereverberation
We present AdVerb, a novel audio-visual dereverberation framework that uses
visual cues in addition to the reverberant sound to estimate clean audio.
Although audio-only dereverberation is a well-studied problem, our approach
incorporates the complementary visual modality to perform audio
dereverberation. Given an image of the environment where the reverberated sound
signal has been recorded, AdVerb employs a novel geometry-aware cross-modal
transformer architecture that captures scene geometry and audio-visual
cross-modal relationship to generate a complex ideal ratio mask, which, when
applied to the reverberant audio predicts the clean sound. The effectiveness of
our method is demonstrated through extensive quantitative and qualitative
evaluations. Our approach significantly outperforms traditional audio-only and
audio-visual baselines on three downstream tasks: speech enhancement, speech
recognition, and speaker verification, with relative improvements in the range
of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly
satisfactory RT60 error scores on the AVSpeech dataset.Comment: Accepted at ICCV 2023. For project page, see
https://gamma.umd.edu/researchdirections/speech/adver
Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's
voice without adaptation parameters. By quantizing speech waveform into
discrete acoustic tokens and modeling these tokens with the language model,
recent language model-based TTS models show zero-shot speaker adaptation
capabilities with only a 3-second acoustic prompt of an unseen speaker.
However, they are limited by the length of the acoustic prompt, which makes it
difficult to clone personal speaking style. In this paper, we propose a novel
zero-shot TTS model with the multi-scale acoustic prompts based on a neural
codec language model VALL-E. A speaker-aware text encoder is proposed to learn
the personal speaking style at the phoneme-level from the style prompt
consisting of multiple sentences. Following that, a VALL-E based acoustic
decoder is utilized to model the timbre from the timbre prompt at the
frame-level and generate speech. The experimental results show that our
proposed method outperforms baselines in terms of naturalness and speaker
similarity, and can achieve better performance by scaling out to a longer style
prompt.Comment: Submitted to ICASSP 202
AT: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Recently, speech representation learning has improved many speech-related
tasks such as speech recognition, speech classification, and speech-to-text
translation. However, all the above tasks are in the direction of speech
understanding, but for the inverse direction, speech synthesis, the potential
of representation learning is yet to be realized, due to the challenging nature
of generating high-quality speech. To address this problem, we propose our
framework, Alignment-Aware Acoustic-Text Pretraining (AT), which
reconstructs masked acoustic signals with text input and acoustic-text
alignment during training. In this way, the pretrained model can generate high
quality of reconstructed spectrogram, which can be applied to the speech
editing and unseen speaker TTS directly. Experiments show AT outperforms
SOTA models on speech editing, and improves multi-speaker speech synthesis
without the external speaker verification model.Comment: under review, 12 pages, 10 figure
An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification
In recent years, self-supervised learning paradigm has received extensive
attention due to its great success in various down-stream tasks. However, the
fine-tuning strategies for adapting those pre-trained models to speaker
verification task have yet to be fully explored. In this paper, we analyze
several feature extraction approaches built on top of a pre-trained model, as
well as regularization and learning rate schedule to stabilize the fine-tuning
process and further boost performance: multi-head factorized attentive pooling
is proposed to factorize the comparison of speaker representations into
multiple phonetic clusters. We regularize towards the parameters of the
pre-trained model and we set different learning rates for each layer of the
pre-trained model during fine-tuning. The experimental results show our method
can significantly shorten the training time to 4 hours and achieve SOTA
performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H,
respectively.Comment: Accepted by SLT202
- …