Search CORE

13 research outputs found

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

Author: Drugman Thomas
Granero-Moya Marcel
Karanasou Penny
Karlapati Sri
Moinet Alexis
Peinelt Nicole
Schnell Bastian
Publication venue
Publication date: 04/09/2023
Field of study

State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS.Comment: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop (SSW) in Grenoble, France, from 26th to 28th August 202

arXiv.org e-Print Archive

Controllable Emphasis with zero data for text-to-speech

Author: Abbas Ammar
Bonafonte Antonio
Drugman Thomas
Hussain Aman
Joly Arnaud
Karanasou Penny
Lajszczak Mateusz
Lombardi Alessandro
Moinet Alexis
Nicolis Marco
Peterova Ekaterina
Sharma Parul
Sokolova Elena
van Korlaar Arent
Publication venue
Publication date: 13/07/2023
Field of study

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by

7.3\%

and correct testers' identification of the emphasized word in a sentence by

40\%

on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202

arXiv.org e-Print Archive

Recommended from our members

Research data supporting "Combining I-vector Representation and Structured Neural Networks for Rapid Adaptation"

Author: Gales Mark
Karanasou Penny
Wu Chunyang
Publication venue
Publication date: 01/01/2016
Field of study

This work was supported by the EPSRC [grant number EP/I031022/1] and by IARPA

Apollo (Cambridge)

Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR

Author: Karanasou Penny
Lamel Lori
Lavergne Thomas
Yvon François
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceTo enhance the recognition lexicon, it is important to be able to add pronunciation variants while keeping the confusability introduced by the extra phonemic variation low. However, this confusability is not easily correlated with the ASR performance, as it is an inherent phenomenon of speech. This paper proposes a method to construct a multiple pronunciation lexicon with a high discriminability. To do so, a phoneme confusion model is used to expand the phonemic search space of pronunciation variants during ASR decoding and a discriminative framework is adopted for the training of the weights of the phoneme confusions. For the parameter estimation, two training algorithms are implemented, the perceptron and the CRF model, using finite state transducers. Experiments on English data were conducted using a large state-of-the-art ASR system of continuous speech

Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR

Author: Karanasou Penny
Lamel Lori
Lavergne Thomas
Yvon François
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

HAL Descartes

Recommended from our members

Improving Interpretability and Regularisation in Deep Learning

Author: Gales Mark
Karanasou Penny
Ragni Anton
Sim Khe Chai
Wu C
Publication venue
Publication date: 07/02/2018
Field of study

The provided .ctm and scoring .sys files correspond to the MPE systems of Table VI (Javanese) and Table X (BN) of this paper

Apollo (Cambridge)

Recommended from our members

Supplementary data for "Speaker Diarisation and Linking in Multi-Genre Broadcast Data"

Author: Gales Mark J.F.
Karanasou Penny
Lanchantin Pierre
Liu Xunying
Qian Yanmin
Wang Linlin
Woodland Philip C.
Zhang Chao
Publication venue: University of Cambridge
Publication date: 02/10/2015
Field of study

Details of audio data availability. Detailed diarisation output and scoring results for primary systems on the development and evaluation data for the MGB challenge.“This work was supported by the EPSRC [grant number EP/I031022/1], Natural Speech Technology programme grant http://www.natural-speech-technology.org/, Cambridge Commonwealth, and the European & International Trust

Apollo (Cambridge)