426 research outputs found
Fast Speech in Unit Selection Speech Synthesis
Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: Universität Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output
Location, location:Enhancing the evaluation of text-to-speech synthesis using the rapid prosody transcription paradigm
Text-to-Speech synthesis systems are generally evaluated using Mean Opinion
Score (MOS) tests, where listeners score samples of synthetic speech on a
Likert scale. A major drawback of MOS tests is that they only offer a general
measure of overall quality-i.e., the naturalness of an utterance-and so cannot
tell us where exactly synthesis errors occur. This can make evaluation of the
appropriateness of prosodic variation within utterances inconclusive. To
address this, we propose a novel evaluation method based on the Rapid Prosody
Transcription paradigm. This allows listeners to mark the locations of errors
in an utterance in real-time, providing a probabilistic representation of the
perceptual errors that occur in the synthetic signal. We conduct experiments
that confirm that the fine-grained evaluation can be mapped to system rankings
of standard MOS tests, but the error marking gives a much more comprehensive
assessment of synthesized prosody. In particular, for standard audiobook test
set samples, we see that error marks consistently cluster around words at major
prosodic boundaries indicated by punctuation. However, for question-answer
based stimuli, where we control information structure, we see differences
emerge in the ability of neural TTS systems to generate context-appropriate
prosodic prominence.Comment: Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en
- …