Search CORE

1 research outputs found

Controllable Emphasis with zero data for text-to-speech

Author: Abbas Ammar
Bonafonte Antonio
Drugman Thomas
Hussain Aman
Joly Arnaud
Karanasou Penny
Lajszczak Mateusz
Lombardi Alessandro
Moinet Alexis
Nicolis Marco
Peterova Ekaterina
Sharma Parul
Sokolova Elena
van Korlaar Arent
Publication venue
Publication date: 13/07/2023
Field of study

We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by

7.3\%

and correct testers' identification of the emphasized word in a sentence by

40\%

on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202

arXiv.org e-Print Archive