13 research outputs found
A Comparative Analysis of Pretrained Language Models for Text-to-Speech
State-of-the-art text-to-speech (TTS) systems have utilized pretrained
language models (PLMs) to enhance prosody and create more natural-sounding
speech. However, while PLMs have been extensively researched for natural
language understanding (NLU), their impact on TTS has been overlooked. In this
study, we aim to address this gap by conducting a comparative analysis of
different PLMs for two TTS tasks: prosody prediction and pause prediction.
Firstly, we trained a prosody prediction model using 15 different PLMs. Our
findings revealed a logarithmic relationship between model size and quality, as
well as significant performance differences between neutral and expressive
prosody. Secondly, we employed PLMs for pause prediction and found that the
task was less sensitive to small models. We also identified a strong
correlation between our empirical results and the GLUE scores obtained for
these language models. To the best of our knowledge, this is the first study of
its kind to investigate the impact of different PLMs on TTS.Comment: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop
(SSW) in Grenoble, France, from 26th to 28th August 202
Controllable Emphasis with zero data for text-to-speech
We present a scalable method to produce high quality emphasis for
text-to-speech (TTS) that does not require recordings or annotations. Many TTS
models include a phoneme duration model. A simple but effective method to
achieve emphasized speech consists in increasing the predicted duration of the
emphasised word. We show that this is significantly better than spectrogram
modification techniques improving naturalness by and correct testers'
identification of the emphasized word in a sentence by on a reference
female en-US voice. We show that this technique significantly closes the gap to
methods that require explicit recordings. The method proved to be scalable and
preferred in all four languages tested (English, Spanish, Italian, German), for
different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202
Recommended from our members
Research data supporting "Combining I-vector Representation and Structured Neural Networks for Rapid Adaptation"
This work was supported by the EPSRC [grant number EP/I031022/1] and by IARPA
Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR
International audienceTo enhance the recognition lexicon, it is important to be able to add pronunciation variants while keeping the confusability introduced by the extra phonemic variation low. However, this confusability is not easily correlated with the ASR performance, as it is an inherent phenomenon of speech. This paper proposes a method to construct a multiple pronunciation lexicon with a high discriminability. To do so, a phoneme confusion model is used to expand the phonemic search space of pronunciation variants during ASR decoding and a discriminative framework is adopted for the training of the weights of the phoneme confusions. For the parameter estimation, two training algorithms are implemented, the perceptron and the CRF model, using finite state transducers. Experiments on English data were conducted using a large state-of-the-art ASR system of continuous speech
Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR
International audienceTo enhance the recognition lexicon, it is important to be able to add pronunciation variants while keeping the confusability introduced by the extra phonemic variation low. However, this confusability is not easily correlated with the ASR performance, as it is an inherent phenomenon of speech. This paper proposes a method to construct a multiple pronunciation lexicon with a high discriminability. To do so, a phoneme confusion model is used to expand the phonemic search space of pronunciation variants during ASR decoding and a discriminative framework is adopted for the training of the weights of the phoneme confusions. For the parameter estimation, two training algorithms are implemented, the perceptron and the CRF model, using finite state transducers. Experiments on English data were conducted using a large state-of-the-art ASR system of continuous speech
Recommended from our members
Improving Interpretability and Regularisation in Deep Learning
The provided .ctm and scoring .sys files correspond to the MPE systems of Table VI (Javanese) and Table X (BN) of this paper
Recommended from our members
Supplementary data for "Speaker Diarisation and Linking in Multi-Genre Broadcast Data"
Details of audio data availability. Detailed diarisation output and scoring results for primary systems on the development and evaluation data for the MGB challenge.“This work was supported by the EPSRC [grant number EP/I031022/1], Natural Speech Technology programme grant http://www.natural-speech-technology.org/, Cambridge Commonwealth, and the European & International Trust