67 research outputs found
Teacher-Student Training for Robust Tacotron-based TTS
While neural end-to-end text-to-speech (TTS) is superior to conventional
statistical methods in many ways, the exposure bias problem in the
autoregressive models remains an issue to be resolved. The exposure bias
problem arises from the mismatch between the training and inference process,
that results in unpredictable performance for out-of-domain test data at
run-time. To overcome this, we propose a teacher-student training scheme for
Tacotron-based TTS by introducing a distillation loss function in addition to
the feature loss function. We first train a Tacotron2-based TTS model by always
providing natural speech frames to the decoder, that serves as a teacher model.
We then train another Tacotron2-based model as a student model, of which the
decoder takes the predicted speech frames as input, similar to how the decoder
works during run-time inference. With the distillation loss, the student model
learns the output probabilities from the teacher model, that is called
knowledge distillation. Experiments show that our proposed training scheme
consistently improves the voice quality for out-of-domain test data both in
Chinese and English systems.Comment: To appear at ICASSP2020, Barcelona, Spai
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
Recent advances in neural network -based text-to-speech have reached human
level naturalness in synthetic speech. The present sequence-to-sequence models
can directly map text to mel-spectrogram acoustic features, which are
convenient for modeling, but present additional challenges for vocoding (i.e.,
waveform generation from the acoustic features). High-quality synthesis can be
achieved with neural vocoders, such as WaveNet, but such autoregressive models
suffer from slow sequential inference. Meanwhile, their existing parallel
inference counterparts are difficult to train and require increasingly large
model sizes. In this paper, we propose an alternative training strategy for a
parallel neural vocoder utilizing generative adversarial networks, and
integrate a linear predictive synthesis filter into the model. Results show
that the proposed model achieves significant improvement in inference speed,
while outperforming a WaveNet in copy-synthesis quality.Comment: Interspeech 2019 accepted versio
A Survey on Neural Speech Synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.Comment: A comprehensive survey on TTS, 63 pages, 18 tables, 7 figures, 457
reference
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Non-autoregressive text to speech (TTS) models such as FastSpeech can
synthesize speech significantly faster than previous autoregressive models with
comparable quality. The training of FastSpeech model relies on an
autoregressive teacher model for duration prediction (to provide more
information as input) and knowledge distillation (to simplify the data
distribution in output), which can ease the one-to-many mapping problem (i.e.,
multiple speech variations correspond to the same text) in TTS. However,
FastSpeech has several disadvantages: 1) the teacher-student distillation
pipeline is complicated and time-consuming, 2) the duration extracted from the
teacher model is not accurate enough, and the target mel-spectrograms distilled
from teacher model suffer from information loss due to data simplification,
both of which limit the voice quality. In this paper, we propose FastSpeech 2,
which addresses the issues in FastSpeech and better solves the one-to-many
mapping problem in TTS by 1) directly training the model with ground-truth
target instead of the simplified output from teacher, and 2) introducing more
variation information of speech (e.g., pitch, energy and more accurate
duration) as conditional inputs. Specifically, we extract duration, pitch and
energy from speech waveform and directly take them as conditional inputs in
training and use predicted values in inference. We further design FastSpeech
2s, which is the first attempt to directly generate speech waveform from text
in parallel, enjoying the benefit of fully end-to-end inference. Experimental
results show that 1) FastSpeech 2 achieves a 3x training speed-up over
FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech
2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even
surpass autoregressive models. Audio samples are available at
https://speechresearch.github.io/fastspeech2/.Comment: Accepted by ICLR 202
SpeedySpeech: Efficient Neural Speech Synthesis
While recent neural sequence-to-sequence models have greatly improved the
quality of speech synthesis, there has not been a system capable of fast
training, fast inference and high-quality audio synthesis at the same time. We
propose a student-teacher network capable of high-quality faster-than-real-time
spectrogram synthesis, with low requirements on computational resources and
fast training time. We show that self-attention layers are not necessary for
generation of high quality audio. We utilize simple convolutional blocks with
residual connections in both student and teacher networks and use only a single
attention layer in the teacher model. Coupled with a MelGAN vocoder, our
model's voice quality was rated significantly higher than Tacotron 2. Our model
can be efficiently trained on a single GPU and can run in real time even on a
CPU. We provide both our source code and audio samples in our GitHub
repository.Comment: 5 pages, 3 figures, Interspeech 202
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named
ESPnet-TTS, which is an extension of the open-source speech processing toolkit
ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including
Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired
by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based
on the design unified with the ESPnet ASR recipe, providing high
reproducibility. The toolkit also provides pre-trained models and samples of
all of the recipes so that users can use it as a baseline. Furthermore, the
unified design enables the integration of ASR functions with TTS, e.g.,
ASR-based objective evaluation and semi-supervised learning with both ASR and
TTS models. This paper describes the design of the toolkit and experimental
evaluation in comparison with other toolkits. The experimental results show
that our models can achieve state-of-the-art performance comparable to the
other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the
LJSpeech dataset. The toolkit is publicly available at
https://github.com/espnet/espnet.Comment: Accepted to ICASSP2020. Demo HP:
https://espnet.github.io/icassp2020-tts
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
This paper presents Non-Attentive Tacotron based on the Tacotron 2
text-to-speech model, replacing the attention mechanism with an explicit
duration predictor. This improves robustness significantly as measured by
unaligned duration ratio and word deletion rate, two metrics introduced in this
paper for large-scale robustness evaluation using a pre-trained speech
recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron
achieves a 5-scale mean opinion score for naturalness of 4.41, slightly
outperforming Tacotron 2. The duration predictor enables both utterance-wide
and per-phoneme control of duration at inference time. When accurate target
durations are scarce or unavailable in the training data, we propose a method
using a fine-grained variational auto-encoder to train the duration predictor
in a semi-supervised or unsupervised manner, with results almost as good as
supervised training.Comment: Under review as a conference paper at ICLR 202
Expressive TTS Training with Frame and Style Reconstruction Loss
We propose a novel training strategy for Tacotron-based text-to-speech (TTS)
system to improve the expressiveness of speech. One of the key challenges in
prosody modeling is the lack of reference that makes explicit modeling
difficult. The proposed technique doesn't require prosody annotations from
training data. It doesn't attempt to model prosody explicitly either, but
rather encodes the association between input text and its prosody styles using
a Tacotron-based TTS framework. Our proposed idea marks a departure from the
style token paradigm where prosody is explicitly modeled by a bank of prosody
embeddings. The proposed training strategy adopts a combination of two
objective functions: 1) frame level reconstruction loss, that is calculated
between the synthesized and target spectral features; 2) utterance level style
reconstruction loss, that is calculated between the deep style features of
synthesized and target speech. The proposed style reconstruction loss is
formulated as a perceptual loss to ensure that utterance level speech style is
taken into consideration during training. Experiments show that the proposed
training strategy achieves remarkable performance and outperforms a
state-of-the-art baseline in both naturalness and expressiveness. To our best
knowledge, this is the first study to incorporate utterance level perceptual
quality as a loss function into Tacotron training for improved expressiveness.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
This paper introduces RyanSpeech, a new speech corpus for research on
automated text-to-speech (TTS) systems. Publicly available TTS corpora are
often noisy, recorded with multiple speakers, or lack quality male speech data.
In order to meet the need for a high quality, publicly available male speech
corpus within the field of speech recognition, we have designed and created
RyanSpeech which contains textual materials from real-world conversational
settings. These materials contain over 10 hours of a professional male voice
actor's speech recorded at 44.1 kHz. This corpus's design and pipeline make
RyanSpeech ideal for developing TTS systems in real-world applications. To
provide a baseline for future research, protocols, and benchmarks, we trained 4
state-of-the-art speech models and a vocoder on RyanSpeech. The results show
3.36 in mean opinion scores (MOS) in our best model. We have made both the
corpus and trained models for public use
- …