442 research outputs found
FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Text-based speech editing (TSE) techniques are designed to enable users to
edit the output audio by modifying the input text transcript instead of the
audio itself. Despite much progress in neural network-based TSE techniques, the
current techniques have focused on reducing the difference between the
generated speech segment and the reference target in the editing region,
ignoring its local and global fluency in the context and original utterance. To
maintain the speech fluency, we propose a fluency speech editing model, termed
\textit{FluentEditor}, by considering fluency-aware training criterion in the
TSE training. Specifically, the \textit{acoustic consistency constraint} aims
to smooth the transition between the edited region and its neighboring acoustic
segments consistent with the ground truth, while the \textit{prosody
consistency constraint} seeks to ensure that the prosody attributes within the
edited regions remain consistent with the overall style of the original
utterance. The subjective and objective experimental results on VCTK
demonstrate that our \textit{FluentEditor} outperforms all advanced baselines
in terms of naturalness and fluency. The audio samples and code are available
at \url{https://github.com/Ai-S2-Lab/FluentEditor}.Comment: Submitted to ICASSP'202
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a
speech sample with the voice characteristic of an unseen speaker. The main
challenge of ZSM-TTS is to increase the overall speaker similarity for unseen
speakers. One of the most successful speaker conditioning methods for
flow-based multi-speaker text-to-speech (TTS) models is to utilize the
functions which predict the scale and bias parameters of the affine coupling
layers according to the given speaker embedding vector. In this letter, we
improve on the previous speaker conditioning method by introducing a
speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker
speech synthesis in a zero-shot manner leveraging a normalization-based
conditioning technique. The newly designed coupling layer explicitly normalizes
the input by the parameters predicted from a speaker embedding vector while
training, enabling an inverse process of denormalizing for a new speaker
embedding at inference. The proposed conditioning scheme yields the
state-of-the-art performance in terms of the speech quality and speaker
similarity in a ZSM-TTS setting.Comment: Accepted to IEEE Signal Processing Letter
Location, location:Enhancing the evaluation of text-to-speech synthesis using the rapid prosody transcription paradigm
Text-to-Speech synthesis systems are generally evaluated using Mean Opinion
Score (MOS) tests, where listeners score samples of synthetic speech on a
Likert scale. A major drawback of MOS tests is that they only offer a general
measure of overall quality-i.e., the naturalness of an utterance-and so cannot
tell us where exactly synthesis errors occur. This can make evaluation of the
appropriateness of prosodic variation within utterances inconclusive. To
address this, we propose a novel evaluation method based on the Rapid Prosody
Transcription paradigm. This allows listeners to mark the locations of errors
in an utterance in real-time, providing a probabilistic representation of the
perceptual errors that occur in the synthetic signal. We conduct experiments
that confirm that the fine-grained evaluation can be mapped to system rankings
of standard MOS tests, but the error marking gives a much more comprehensive
assessment of synthesized prosody. In particular, for standard audiobook test
set samples, we see that error marks consistently cluster around words at major
prosodic boundaries indicated by punctuation. However, for question-answer
based stimuli, where we control information structure, we see differences
emerge in the ability of neural TTS systems to generate context-appropriate
prosodic prominence.Comment: Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
Stutter removal is an essential scenario in the field of speech editing.
However, when the speech recording contains stutters, the existing text-based
speech editing approaches still suffer from: 1) the over-smoothing problem in
the edited speech; 2) lack of robustness due to the noise introduced by
stutter; 3) to remove the stutters, users are required to determine the edited
region manually. To tackle the challenges in stutter removal, we propose
FluentSpeech, a stutter-oriented automatic speech editing model. Specifically,
1) we propose a context-aware diffusion model that iteratively refines the
modified mel-spectrogram with the guidance of context features; 2) we introduce
a stutter predictor module to inject the stutter information into the hidden
sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE)
dataset that contains spontaneous speech recordings with time-aligned stutter
labels to train the automatic stutter localization model. Experimental results
on VCTK and LibriTTS datasets demonstrate that our model achieves
state-of-the-art performance on speech editing. Further experiments on our SASE
dataset show that FluentSpeech can effectively improve the fluency of
stuttering speech in terms of objective and subjective metrics. Code and audio
samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.Comment: Accepted by ACL 2023 (Findings
Defend Data Poisoning Attacks on Voice Authentication
With the advances in deep learning, speaker verification has achieved very
high accuracy and is gaining popularity as a type of biometric authentication
option in many scenes of our daily life, especially the growing market of web
services. Compared to traditional passwords, "vocal passwords" are much more
convenient as they relieve people from memorizing different passwords. However,
new machine learning attacks are putting these voice authentication systems at
risk. Without a strong security guarantee, attackers could access legitimate
users' web accounts by fooling the deep neural network (DNN) based voice
recognition models. In this paper, we demonstrate an easy-to-implement data
poisoning attack to the voice authentication system, which can hardly be
captured by existing defense mechanisms. Thus, we propose a more robust defense
method, called Guardian, which is a convolutional neural network-based
discriminator. The Guardian discriminator integrates a series of novel
techniques including bias reduction, input augmentation, and ensemble learning.
Our approach is able to distinguish about 95% of attacked accounts from normal
accounts, which is much more effective than existing approaches with only 60%
accuracy
- …