842 research outputs found
Glottal Source and Prosodic Prominence Modelling in HMM-based Speech Synthesis for the Blizzard Challenge 2009
This paper describes the CSTR entry for the Blizzard Challenge 2009. The work focused on modifying two parts of the Nitech 2005 HTS speech synthesis system to improve naturalness and contextual appropriateness. The first part incorporated an implementation of the Linjencrants-Fant (LF) glottal source model. The second part focused on improving synthesis of prosodic prominence including emphasis through context dependent phonemes. Emphasis was assigned to the synthesised test sentences based on a handful of theory based rules. The two parts (LF-model and prosodic prominence) were not combined and hence evaluated separately. The results on naturalness for the LF-model showed that it is not yet perceived as natural as the Benchmark HTS system for neutral speech. The results for the prosodic prominence modelling showed that it was perceived as contextually appropriate as the Benchmark HTS system, despite a low naturalness score. The Blizzard challenge evaluation has provided valuable information on the status of our work and continued work will begin with analysing why our modifications resulted in reduced naturalness compared to the Benchmark HTS system
HMM-based speech synthesiser using the LF-model of the glottal source
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6 % preference for the new system, as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further
A Multi-Level Context-Dependent Prosodic Model applied to duration modeling
International audienceon the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllablebased durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error. Index Terms : speech synthesis, prosody, multi-level model, context-dependent model
Towards Personalized Synthesized Voices for Individuals with Vocal Disabilities: Voice Banking and Reconstruction
When individuals lose the ability to produce their own speech, due to degenerative diseases such as motor neurone disease (MND) or Parkinson’s, they lose not only a functional means of communication but also a display of their individual and group identity. In order to build personalized synthetic voices, attempts have been made to capture the voice before it is lost, using a process known as voice banking. But, for some patients, the speech deterioration frequently coincides or quickly follows diagnosis. Using HMM-based speech synthesis, it is now possible to build personalized synthetic voices with minimal data recordings and even disordered speech. The power of this approach is that it is possible to use the patient’s recordings to adapt existing voice models pre-trained on many speakers. When the speech has begun to deteriorate, the adapted voice model can be further modified in order to compensate for the disordered characteristics found in the patient’s speech. The University of Edinburgh has initiated a project for voice banking and reconstruction based on this speech synthesis technology. At the current stage of the project, more than fifteen patients with MND have already been recorded and five of them have been delivered a reconstructed voice. In this paper, we present an overview of the project as well as subjective assessments of the reconstructed voices and feedback from patients and their families
Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0
In English, prosody adds a broad range of information to segment sequences,
from information structure (e.g. contrast) to stylistic variation (e.g.
expression of emotion). However, when learning to control prosody in
text-to-speech voices, it is not clear what exactly the control is modifying.
Existing research on discrete representation learning for prosody has
demonstrated high naturalness, but no analysis has been performed on what these
representations capture, or if they can generate meaningfully-distinct variants
of an utterance. We present a phrase-level variational autoencoder with a
multi-modal prior, using the mode centres as "intonation codes". Our evaluation
establishes which intonation codes are perceptually distinct, finding that the
intonation codes from our multi-modal latent model were significantly more
distinct than a baseline using k-means clustering. We carry out a follow-up
qualitative study to determine what information the codes are carrying. Most
commonly, listeners commented on the intonation codes having a statement or
question style. However, many other affect-related styles were also reported,
including: emotional, uncertain, surprised, sarcastic, passive aggressive, and
upset.Comment: Published to the 10th ISCA International Conference on Speech Prosody
(SP2020
Using generative modelling to produce varied intonation for speech synthesis
Unlike human speakers, typical text-to-speech (TTS) systems are unable to
produce multiple distinct renditions of a given sentence. This has previously
been addressed by adding explicit external control. In contrast, generative
models are able to capture a distribution over multiple renditions and thus
produce varied renditions using sampling. Typical neural TTS models learn the
average of the data because they minimise mean squared error. In the context of
prosody, taking the average produces flatter, more boring speech: an "average
prosody". A generative model that can synthesise multiple prosodies will, by
design, not model average prosody. We use variational autoencoders (VAEs) which
explicitly place the most "average" data close to the mean of the Gaussian
prior. We propose that by moving towards the tails of the prior distribution,
the model will transition towards generating more idiosyncratic, varied
renditions. Focusing here on intonation, we investigate the trade-off between
naturalness and intonation variation and find that typical acoustic models can
either be natural, or varied, but not both. However, sampling from the tails of
the VAE prior produces much more varied intonation than the traditional
approaches, whilst maintaining the same level of naturalness.Comment: Accepted for the 10th ISCA Speech Synthesis Workshop (SSW10
- …