12 research outputs found
Fast Speech in Unit Selection Speech Synthesis
Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: Universität Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output
Pronunciation modelling in end-to-end text-to-speech synthesis
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve
high-quality naturalness scores without extensive processing of text-input. Since S2S
models have been proposed in multiple aspects of the TTS pipeline, the field has focused
on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform
is predicted directly from a sequence of text or phone characters. Early work on E2ETTS
in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation
(lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder
during training. The benefits of a learned text encoding include improved modelling
of phonetic context, which make contextual linguistic features traditionally used in
TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar
naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling
of phonetic context has led some to question the benefit of using phone- instead of
text-input altogether (see [5]).
The use of text-input brings into question the value of the pronunciation lexicon
in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme
(G2P) model from text-audio pairs during training. With common datasets
for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates
compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation
is difficult for some words (e.g. foreign words and proper names) since
the knowledge to disambiguate their pronunciations may not be provided by the local
grapheme context and may require knowledge beyond that contained in sentence-level
text-audio sequences. When test stimuli were selected according to G2P difficulty,
increased mispronunciations in E2E-TTS with text-input were observed. Following
the proposed benefits of subword decomposition in S2S modelling in other language
tasks (e.g. neural machine translation), the effects of morphological decomposition
were investigated on pronunciation modelling. Learning of the French post-lexical
phenomenon liaison was also evaluated.
With the goal of an inexpensive, large-scale evaluation of pronunciation modelling,
the reliability of automatic speech recognition (ASR) to measure TTS intelligibility
was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge
was conducted. ASR reliably found similar significant differences between systems
as paid listeners in controlled conditions in English. An analysis of transcriptions for
words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR
Transformer model used was found to be unreliable in its transcription of difficult G2P
relations due to homophonic transcription and incorrect transcription of words with
difficult G2P relations. A further evaluation of representation mixing in Tacotron finds
pronunciation correction is possible when mixing text- and phone-inputs. The thesis
concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a
pronunciation guide since it can provide assurances that G2P generalisation cannot
Overcoming the limitations of statistical parametric speech synthesis
At the time of beginning this thesis, statistical parametric speech synthesis (SPSS)
using hidden Markov models (HMMs) was the dominant synthesis paradigm within the
research community. SPSS systems are effective at generalising across the linguistic
contexts present in training data to account for inevitable unseen linguistic contexts at
synthesis-time, making these systems flexible and their performance stable. However
HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning
that, despite great progress, the speech output is rarely confused for natural speech.
There are many hypotheses for the causes of reduced synthesis quality, and subsequent
required improvements, for HMM speech synthesis in literature. However, until this
thesis, these hypothesised causes were rarely tested.
This thesis makes two types of contributions to the field of speech synthesis; each
of these appears in a separate part of the thesis. Part I introduces a methodology for
testing hypothesised causes of limited quality within HMM speech synthesis systems.
This investigation aims to identify what causes these systems to fall short of natural
speech. Part II uses the findings from Part I of the thesis to make informed improvements
to speech synthesis.
The usual approach taken to improve synthesis systems is to attribute reduced synthesis
quality to a hypothesised cause. A new system is then constructed with the aim
of removing that hypothesised cause. However this is typically done without prior testing
to verify the hypothesised cause of reduced quality. As such, even if improvements
in synthesis quality are observed, there is no knowledge of whether a real underlying
issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a
wide range of perceptual tests in Part I of the thesis to discover what the real underlying
causes of reduced quality in HMM synthesis are and the level to which they contribute.
Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements
to synthesis quality. Two well-motivated improvements to standard HMM
synthesis are investigated. The first of these improvements follows on from averaging
across differing linguistic contexts being identified as a major contributing factor to
reduced synthesis quality. This is a practice typically performed during decision tree
regression in HMM synthesis. Therefore a system which removes averaging across
differing linguistic contexts and instead performs averaging only across matching linguistic
contexts (called rich-context synthesis) is investigated. The second of the motivated
improvements follows the finding that the parametrisation (i.e., vocoding) of
speech, standard practice in SPSS, introduces a noticeable drop in quality before any
modelling is even performed. Therefore the hybrid synthesis paradigm is investigated.
These systems aim to remove the effect of vocoding by using SPSS to inform the selection
of units in a unit selection system. Both of the motivated improvements applied
in Part II are found to make significant gains in synthesis quality, demonstrating the
benefit of performing the style of perceptual testing conducted in the thesis