650 research outputs found
Speech synthesis, Speech simulation and speech science
Speech synthesis research has been transformed in recent years through the exploitation of speech corpora - both for statistical modelling and as a source of signals for concatenative synthesis. This revolution in methodology and the new techniques it brings calls into question the received wisdom that better computer voice output will come from a better understanding of how humans produce speech. This paper discusses the relationship between this new technology of simulated speech and the traditional aims of speech science. The paper suggests that the goal of speech simulation frees engineers from inadequate linguistic and physiological descriptions of speech. But at the same time, it leaves speech scientists free to return to their proper goal of building a computational model of human speech production
Prosody Modelling in Concept-to-Speech Generation: Methodological Issues
We explore three issues for the development of concept-to-speech (CTS) systems. We identify information available in a language-generation system that has the potential to impact prosody; investigate the role played by different corpora in CTS prosody modelling; and explore different methodologies for learning how linguistic features
impact prosody. Our major focus is on the comparison of two machine learning methodologies: generalized rule induction and memory-based learning. We describe this work in the context of multimedia abstract generation of intensive care (MAGIC) data, a system that produces multimedia brings of the status of patients who have just undergone a bypass operation
Using generative modelling to produce varied intonation for speech synthesis
Unlike human speakers, typical text-to-speech (TTS) systems are unable to
produce multiple distinct renditions of a given sentence. This has previously
been addressed by adding explicit external control. In contrast, generative
models are able to capture a distribution over multiple renditions and thus
produce varied renditions using sampling. Typical neural TTS models learn the
average of the data because they minimise mean squared error. In the context of
prosody, taking the average produces flatter, more boring speech: an "average
prosody". A generative model that can synthesise multiple prosodies will, by
design, not model average prosody. We use variational autoencoders (VAEs) which
explicitly place the most "average" data close to the mean of the Gaussian
prior. We propose that by moving towards the tails of the prior distribution,
the model will transition towards generating more idiosyncratic, varied
renditions. Focusing here on intonation, we investigate the trade-off between
naturalness and intonation variation and find that typical acoustic models can
either be natural, or varied, but not both. However, sampling from the tails of
the VAE prior produces much more varied intonation than the traditional
approaches, whilst maintaining the same level of naturalness.Comment: Accepted for the 10th ISCA Speech Synthesis Workshop (SSW10
- âŠ