9 research outputs found

    Statistical parametric speech synthesis using conversational data and phenomena

    Get PDF
    Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods

    Pronunciation modelling in end-to-end text-to-speech synthesis

    Get PDF
    Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality naturalness scores without extensive processing of text-input. Since S2S models have been proposed in multiple aspects of the TTS pipeline, the field has focused on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform is predicted directly from a sequence of text or phone characters. Early work on E2ETTS in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation (lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder during training. The benefits of a learned text encoding include improved modelling of phonetic context, which make contextual linguistic features traditionally used in TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling of phonetic context has led some to question the benefit of using phone- instead of text-input altogether (see [5]). The use of text-input brings into question the value of the pronunciation lexicon in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme (G2P) model from text-audio pairs during training. With common datasets for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation is difficult for some words (e.g. foreign words and proper names) since the knowledge to disambiguate their pronunciations may not be provided by the local grapheme context and may require knowledge beyond that contained in sentence-level text-audio sequences. When test stimuli were selected according to G2P difficulty, increased mispronunciations in E2E-TTS with text-input were observed. Following the proposed benefits of subword decomposition in S2S modelling in other language tasks (e.g. neural machine translation), the effects of morphological decomposition were investigated on pronunciation modelling. Learning of the French post-lexical phenomenon liaison was also evaluated. With the goal of an inexpensive, large-scale evaluation of pronunciation modelling, the reliability of automatic speech recognition (ASR) to measure TTS intelligibility was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge was conducted. ASR reliably found similar significant differences between systems as paid listeners in controlled conditions in English. An analysis of transcriptions for words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR Transformer model used was found to be unreliable in its transcription of difficult G2P relations due to homophonic transcription and incorrect transcription of words with difficult G2P relations. A further evaluation of representation mixing in Tacotron finds pronunciation correction is possible when mixing text- and phone-inputs. The thesis concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a pronunciation guide since it can provide assurances that G2P generalisation cannot

    Conversational Arabic Automatic Speech Recognition

    Get PDF
    Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task

    Proceedings of the 19th Sound and Music Computing Conference

    Get PDF
    Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France). https://smc22.grame.f
    corecore