22 research outputs found
Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis
Speech style control and transfer techniques aim to enrich the diversity and
expressiveness of synthesized speech. Existing approaches model all speech
styles into one representation, lacking the ability to control a specific
speech feature independently. To address this issue, we introduce a novel
multi-reference structure to Tacotron and propose intercross training approach,
which together ensure that each sub-encoder of the multi-reference encoder
independently disentangles and controls a specific style. Experimental results
show that our model is able to control and transfer desired speech styles
individually.Comment: Submitted for Interspeech 2019, 5 page
Conversational End-to-End TTS for Voice Agent
End-to-end neural TTS has achieved superior performance on reading style
speech synthesis. However, it's still a challenge to build a high-quality
conversational TTS due to the limitations of the corpus and modeling
capability. This study aims at building a conversational TTS for a voice agent
under sequence to sequence modeling framework. We firstly construct a
spontaneous conversational speech corpus well designed for the voice agent with
a new recording scheme ensuring both recording quality and conversational
speaking style. Secondly, we propose a conversation context-aware end-to-end
TTS approach which has an auxiliary encoder and a conversational context
encoder to reinforce the information about the current utterance and its
context in a conversation as well. Experimental results show that the proposed
methods produce more natural prosody in accordance with the conversational
context, with significant preference gains at both utterance-level and
conversation-level. Moreover, we find that the model has the ability to express
some spontaneous behaviors, like fillers and repeated words, which makes the
conversational speaking style more realistic.Comment: Accepted by SLT 2021; 7 page
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
This paper introduces a new speech corpus called "LibriTTS" designed for
text-to-speech use. It is derived from the original audio and text materials of
the LibriSpeech corpus, which has been used for training and evaluating
automatic speech recognition systems. The new corpus inherits desired
properties of the LibriSpeech corpus while addressing a number of issues which
make LibriSpeech less than ideal for text-to-speech work. The released corpus
consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers
and the corresponding texts. Experimental results show that neural end-to-end
TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion
scores in naturalness in five out of six evaluation speakers. The corpus is
freely available for download from http://www.openslr.org/60/.Comment: Submitted for Interspeech 2019, 7 page
A comparison of Vietnamese Statistical Parametric Speech Synthesis Systems
In recent years, statistical parametric speech synthesis (SPSS) systems have
been widely utilized in many interactive speech-based systems (e.g.~Amazon's
Alexa, Bose's headphones). To select a suitable SPSS system, both speech
quality and performance efficiency (e.g.~decoding time) must be taken into
account. In the paper, we compared four popular Vietnamese SPSS techniques
using: 1) hidden Markov models (HMM), 2) deep neural networks (DNN), 3)
generative adversarial networks (GAN), and 4) end-to-end (E2E) architectures,
which consists of Tacontron~2 and WaveGlow vocoder in terms of speech quality
and performance efficiency. We showed that the E2E systems accomplished the
best quality, but required the power of GPU to achieve real-time performance.
We also showed that the HMM-based system had inferior speech quality, but it
was the most efficient system. Surprisingly, the E2E systems were more
efficient than the DNN and GAN in inference on GPU. Surprisingly, the GAN-based
system did not outperform the DNN in term of quality.Comment: 9 pages, submitted to KSE 202
Learning latent representations for style control and transfer in end-to-end speech synthesis
In this paper, we introduce the Variational Autoencoder (VAE) to an
end-to-end speech synthesis model, to learn the latent representation of
speaking styles in an unsupervised manner. The style representation learned
through VAE shows good properties such as disentangling, scaling, and
combination, which makes it easy for style control. Style transfer can be
achieved in this framework by first inferring style representation through the
recognition network of VAE, then feeding it into TTS network to guide the style
in synthesizing speech. To avoid Kullback-Leibler (KL) divergence collapse in
training, several techniques are adopted. Finally, the proposed model shows
good performance of style control and outperforms Global Style Token (GST)
model in ABX preference tests on style transfer.Comment: Paper accepted by ICASSP 201
Probing the phonetic and phonological knowledge of tones in Mandarin TTS models
This study probes the phonetic and phonological knowledge of lexical tones in
TTS models through two experiments. Controlled stimuli for testing tonal
coarticulation and tone sandhi in Mandarin were fed into Tacotron 2 and
WaveGlow to generate speech samples, which were subject to acoustic analysis
and human evaluation. Results show that both baseline Tacotron 2 and Tacotron 2
with BERT embeddings capture the surface tonal coarticulation patterns well but
fail to consistently apply the Tone-3 sandhi rule to novel sentences.
Incorporating pre-trained BERT embeddings into Tacotron 2 improves the
naturalness and prosody performance, and yields better generalization of Tone-3
sandhi rules to novel complex sentences, although the overall accuracy for
Tone-3 sandhi was still low. Given that TTS models do capture some linguistic
phenomena, it is argued that they can be used to generate and validate certain
linguistic hypotheses. On the other hand, it is also suggested that
linguistically informed stimuli should be included in the training and the
evaluation of TTS models.Comment: Submitted to Speech Prosody 202
Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved
extraordinary performance. But a studio-quality corpus with manual
transcription is necessary to train such seq2seq systems. In this paper, we
propose an approach to build high-quality and stable seq2seq based speech
synthesis system using challenging found data, where training speech contains
noisy interferences (acoustic noise) and texts are imperfect speech recognition
transcripts (textual noise). To deal with text-side noise, we propose a VQVAE
based heuristic method to compensate erroneous linguistic feature with phonetic
information learned directly from speech. As for the speech-side noise, we
propose to learn a noise-independent feature in the auto-regressive decoder
through adversarial training and data augmentation, which does not need an
extra speech enhancement model. Experiments show the effectiveness of the
proposed approach in dealing with text-side and speech-side noise. Surpassing
the denoising approach based on a state-of-the-art speech enhancement model,
our system built on noisy found data can synthesize clean and high-quality
speech with MOS close to the system built on the clean counterpart.Comment: submitted to IEEE SP
Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment
Sequence-to-sequence text-to-speech (TTS) is dominated by
soft-attention-based methods. Recently, hard-attention-based methods have been
proposed to prevent fatal alignment errors, but their sampling method of
discrete alignment is poorly investigated. This research investigates various
combinations of sampling methods and probability distributions for alignment
transition modeling in a hard-alignment-based sequence-to-sequence TTS method
called SSNT-TTS. We clarify the common sampling methods of discrete variables
including greedy search, beam search, and random sampling from a Bernoulli
distribution in a more general way. Furthermore, we introduce the binary
Concrete distribution to model discrete variables more properly. The results of
a listening test shows that deterministic search is more preferable than
stochastic search, and the binary Concrete distribution is robust with
stochastic search for natural alignment transition.Comment: Submitted to ICASSP 202
Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System
Abstract End-to-end text-to-speech (TTS) systems has proved its great success
in the presence of a large amount of high-quality training data recorded in
anechoic room with high-quality microphone. Another approach is to use
available source of found data like radio broadcast news. We aim to optimize
the naturalness of TTS system on the found data using a novel data processing
method. The data processing method includes 1) utterance selection and 2)
prosodic punctuation insertion to prepare training data which can optimize the
naturalness of TTS systems. We showed that using the processing data method, an
end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of
natural speech. We showed that the punctuation insertion contributed the most
to the result. To facilitate the research and development of TTS systems, we
distributed the processed data of one speaker at
https://forms.gle/6Hk5YkqgDxAaC2BU6.Comment: 8 pages, 2 figures, submit to Oriental Cocosd
Fast and High-Quality Singing Voice Synthesis System based on Convolutional Neural Networks
The present paper describes singing voice synthesis based on convolutional
neural networks (CNNs). Singing voice synthesis systems based on deep neural
networks (DNNs) are currently being proposed and are improving the naturalness
of synthesized singing voices. As singing voices represent a rich form of
expression, a powerful technique to model them accurately is required. In the
proposed technique, long-term dependencies of singing voices are modeled by
CNNs. An acoustic feature sequence is generated for each segment that consists
of long-term frames, and a natural trajectory is obtained without the parameter
generation algorithm. Furthermore, a computational complexity reduction
technique, which drives the DNNs in different time units depending on type of
musical score features, is proposed. Experimental results show that the
proposed method can synthesize natural sounding singing voices much faster than
the conventional method.Comment: Accepted to ICASSP 2020. Singing voice samples (Japanese, English,
Chinese): https://www.techno-speech.com/news-20181214a-en. arXiv admin note:
substantial text overlap with arXiv:1904.0686