12 research outputs found
Conversational End-to-End TTS for Voice Agent
End-to-end neural TTS has achieved superior performance on reading style
speech synthesis. However, it's still a challenge to build a high-quality
conversational TTS due to the limitations of the corpus and modeling
capability. This study aims at building a conversational TTS for a voice agent
under sequence to sequence modeling framework. We firstly construct a
spontaneous conversational speech corpus well designed for the voice agent with
a new recording scheme ensuring both recording quality and conversational
speaking style. Secondly, we propose a conversation context-aware end-to-end
TTS approach which has an auxiliary encoder and a conversational context
encoder to reinforce the information about the current utterance and its
context in a conversation as well. Experimental results show that the proposed
methods produce more natural prosody in accordance with the conversational
context, with significant preference gains at both utterance-level and
conversation-level. Moreover, we find that the model has the ability to express
some spontaneous behaviors, like fillers and repeated words, which makes the
conversational speaking style more realistic.Comment: Accepted by SLT 2021; 7 page
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Prosody Transfer (PT) is a technique that aims to use the prosody from a
source audio as a reference while synthesising speech. Fine-grained PT aims at
capturing prosodic aspects like rhythm, emphasis, melody, duration, and
loudness, from a source audio at a very granular level and transferring them
when synthesising speech in a different target speaker's voice. Current
approaches for fine-grained PT suffer from source speaker leakage, where the
synthesised speech has the voice identity of the source speaker as opposed to
the target speaker. In order to mitigate this issue, they compromise on the
quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT
system that is robust to source speaker leakage, without using parallel data.
We achieve this through a novel reference encoder architecture capable of
capturing temporal prosodic representations which are robust to source speaker
leakage. We compare CopyCat against a state-of-the-art fine-grained PT model
through various subjective evaluations, where we show a relative improvement of
in the quality of prosody transfer and in preserving the target
speaker identity, while still maintaining the same naturalness
Teacher-Student Training for Robust Tacotron-based TTS
While neural end-to-end text-to-speech (TTS) is superior to conventional
statistical methods in many ways, the exposure bias problem in the
autoregressive models remains an issue to be resolved. The exposure bias
problem arises from the mismatch between the training and inference process,
that results in unpredictable performance for out-of-domain test data at
run-time. To overcome this, we propose a teacher-student training scheme for
Tacotron-based TTS by introducing a distillation loss function in addition to
the feature loss function. We first train a Tacotron2-based TTS model by always
providing natural speech frames to the decoder, that serves as a teacher model.
We then train another Tacotron2-based model as a student model, of which the
decoder takes the predicted speech frames as input, similar to how the decoder
works during run-time inference. With the distillation loss, the student model
learns the output probabilities from the teacher model, that is called
knowledge distillation. Experiments show that our proposed training scheme
consistently improves the voice quality for out-of-domain test data both in
Chinese and English systems.Comment: To appear at ICASSP2020, Barcelona, Spai
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
This paper introduces Parallel Tacotron 2, a non-autoregressive neural
text-to-speech model with a fully differentiable duration model which does not
require supervised duration signals. The duration model is based on a novel
attention mechanism and an iterative reconstruction loss based on Soft Dynamic
Time Warping, this model can learn token-frame alignments as well as token
durations automatically. Experimental results show that Parallel Tacotron 2
outperforms baselines in subjective naturalness in several diverse multi
speaker evaluations. Its duration control capability is also demonstrated.Comment: Submitted to INTERSPEECH 202
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS
End-to-end neural TTS training has shown improved performance in speech style
transfer. However, the improvement is still limited by the training data in
both target styles and speakers. Inadequate style transfer performance occurs
when the trained TTS tries to transfer the speech to a target style from a new
speaker with an unknown, arbitrary style. In this paper, we propose a new
approach to style transfer for both seen and unseen styles, with disjoint,
multi-style datasets, i.e., datasets of different styles are recorded, each
individual style is by one speaker with multiple utterances. To encode the
style information, we adopt an inverse autoregressive flow (IAF) structure to
improve the variational inference. The whole system is optimized to minimize a
weighed sum of four different loss functions: 1) a reconstruction loss to
measure the distortions in both source and target reconstructions; 2) an
adversarial loss to "fool" a well-trained discriminator; 3) a style distortion
loss to measure the expected style loss after the transfer; 4) a cycle
consistency loss to preserve the speaker identity of the source after the
transfer. Experiments demonstrate, both objectively and subjectively, the
effectiveness of the proposed approach for seen and unseen style transfer
tasks. The performance of the new approach is better and more robust than those
of four baseline systems of the prior art
Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN
Cross-lingual voice conversion aims to change source speaker's voice to sound
like that of target speaker, when source and target speakers speak different
languages. It relies on non-parallel training data from two different
languages, hence, is more challenging than mono-lingual voice conversion.
Previous studies on cross-lingual voice conversion mainly focus on spectral
conversion with a linear transformation for F0 transfer. However, as an
important prosodic factor, F0 is inherently hierarchical, thus it is
insufficient to just use a linear method for conversion. We propose the use of
continuous wavelet transform (CWT) decomposition for F0 modeling. CWT provides
a way to decompose a signal into different temporal scales that explain prosody
in different time resolutions. We also propose to train two CycleGAN pipelines
for spectrum and prosody mapping respectively. In this way, we eliminate the
need for parallel data of any two languages and any alignment techniques.
Experimental results show that our proposed Spectrum-Prosody-CycleGAN framework
outperforms the Spectrum-CycleGAN baseline in subjective evaluation. To our
best knowledge, this is the first study of prosody in cross-lingual voice
conversion.Comment: Accepted to APSIPA ASC 202
VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech
Emotional voice conversion (EVC) aims to convert the emotion of speech from
one state to another while preserving the linguistic content and speaker
identity. In this paper, we study the disentanglement and recomposition of
emotional elements in speech through variational autoencoding Wasserstein
generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC
framework based on VAW-GAN, that includes two VAW-GAN pipelines, one for
spectrum conversion, and another for prosody conversion. We train a spectral
encoder that disentangles emotion and prosody (F0) information from spectral
features; we also train a prosodic encoder that disentangles emotion modulation
of prosody (affective prosody) from linguistic prosody. At run-time, the
decoder of spectral VAW-GAN is conditioned on the output of prosodic VAW-GAN.
The vocoder takes the converted spectral and prosodic features to generate the
target emotional speech. Experiments validate the effectiveness of our proposed
method in both objective and subjective evaluations.Comment: Accepted by IEEE SLT 2021. arXiv admin note: text overlap with
arXiv:2005.0702
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
This paper presents Non-Attentive Tacotron based on the Tacotron 2
text-to-speech model, replacing the attention mechanism with an explicit
duration predictor. This improves robustness significantly as measured by
unaligned duration ratio and word deletion rate, two metrics introduced in this
paper for large-scale robustness evaluation using a pre-trained speech
recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron
achieves a 5-scale mean opinion score for naturalness of 4.41, slightly
outperforming Tacotron 2. The duration predictor enables both utterance-wide
and per-phoneme control of duration at inference time. When accurate target
durations are scarce or unavailable in the training data, we propose a method
using a fine-grained variational auto-encoder to train the duration predictor
in a semi-supervised or unsupervised manner, with results almost as good as
supervised training.Comment: Under review as a conference paper at ICLR 202
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
End-to-End Adversarial Text-to-Speech
Modern text-to-speech synthesis pipelines typically involve multiple
processing stages, each of which is designed or learnt independently from the
rest. In this work, we take on the challenging task of learning to synthesise
speech from normalised text or phonemes in an end-to-end manner, resulting in
models which operate directly on character or phoneme input sequences and
produce raw speech audio outputs. Our proposed generator is feed-forward and
thus efficient for both training and inference, using a differentiable
alignment scheme based on token length prediction. It learns to produce high
fidelity audio through a combination of adversarial feedback and prediction
losses constraining the generated audio to roughly match the ground truth in
terms of its total duration and mel-spectrogram. To allow the model to capture
temporal variation in the generated audio, we employ soft dynamic time warping
in the spectrogram-based prediction loss. The resulting model achieves a mean
opinion score exceeding 4 on a 5 point scale, which is comparable to the
state-of-the-art models relying on multi-stage training and additional
supervision.Comment: 23 pages. In proceedings of ICLR 202