1 research outputs found
Controllable Emphasis with zero data for text-to-speech
We present a scalable method to produce high quality emphasis for
text-to-speech (TTS) that does not require recordings or annotations. Many TTS
models include a phoneme duration model. A simple but effective method to
achieve emphasized speech consists in increasing the predicted duration of the
emphasised word. We show that this is significantly better than spectrogram
modification techniques improving naturalness by and correct testers'
identification of the emphasized word in a sentence by on a reference
female en-US voice. We show that this technique significantly closes the gap to
methods that require explicit recordings. The method proved to be scalable and
preferred in all four languages tested (English, Spanish, Italian, German), for
different voices and multiple speaking styles.Comment: In proceeding of 12th Speech Synthesis Workshop (SSW) 202