231 research outputs found
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays
an important role in producing natural and intelligible speech. Although
inter-utterance linguistic information can influence the speech interpretation
of the target utterance, previous works on PSP mainly focus on utilizing
intrautterance linguistic information of the current utterance only. This work
proposes to use inter-utterance linguistic information to improve the
performance of PSP. Multi-level contextual information, which includes both
inter-utterance and intrautterance linguistic information, is extracted by a
hierarchical encoder from character level, utterance level and discourse level
of the input text. Then a multi-task learning (MTL) decoder predicts prosodic
boundaries from multi-level contextual information. Objective evaluation
results on two datasets show that our method achieves better F1 scores in
predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase
(IPH). It demonstrates the effectiveness of using multi-level contextual
information for PSP. Subjective preference tests also indicate the naturalness
of synthesized speeches are improved.Comment: Accepted by Interspeech202
Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model
This paper explores the potential of constructing an AI spoken dialogue
system that "thinks how to respond" and "thinks how to speak" simultaneously,
which more closely aligns with the human speech production process compared to
the current cascade pipeline of independent chatbot and Text-to-Speech (TTS)
modules. We hypothesize that Large Language Models (LLMs) with billions of
parameters possess significant speech understanding capabilities and can
jointly model dialogue responses and linguistic features. We conduct two sets
of experiments: 1) Prosodic structure prediction, a typical front-end task in
TTS, demonstrating the speech understanding ability of LLMs, and 2) Further
integrating dialogue response and a wide array of linguistic features using a
unified encoding format. Our results indicate that the LLM-based approach is a
promising direction for building unified spoken dialogue systems
Tone labelling algorithm for Sesotho
M.Sc., Faculty of Science, University of the Witwatersrand, 2011Studies have shown that text-to-speech systems need detailed prosodic models of a language in order to
ideally sound natural to native speakers of the language. A text-to-speech system developed for Sesotho
needs to have tone implemented in it since Sesotho is a tonal language which uses pitch variations to
distinguish lexical and/or grammatical meaning.
In order to implement tone for a language such as Sesotho, it is necessary for a tone modeling
algorithm to receive as input the tone labels of the syllables of a word. This allows the algorithm to
predict the appropriate intonation of the word. The aim of our study is to improve a basic tone labeling
algorithm that predicts tone labels using three Sesotho tonal rules. The application of this algorithm
is restricted to polysyllabic verb stems. The research study involves implementing an extended tone
labeling algorithm that implements four additional Sesotho tonal rules and extends its application to all
the other parts of speech.
The results of our study show that the extended tone labeling algorithm significantly improves the
basic algorithm by increasing the number of matched tone labels. Furthermore, our study provides the
basic step to tone modeling for languages such as Sesotho which do not mark tone labels in orthography
Prosody analysis and modeling for Cantonese text-to-speech.
Li Yu Jia.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1. --- TTS Technology --- p.1Chapter 1.2. --- Prosody --- p.2Chapter 1.2.1. --- What is Prosody --- p.2Chapter 1.2.2. --- Prosody from Different Perspectives --- p.3Chapter 1.2.3. --- Acoustical Parameters of Prosody --- p.3Chapter 1.2.4. --- Prosody in TTS --- p.5Chapter 1.2.4.1 --- Analysis --- p.5Chapter 1.2.4.2 --- Modeling --- p.6Chapter 1.2.4.3 --- Evaluation --- p.6Chapter 1.3. --- Thesis Objectives --- p.7Chapter 1.4. --- Thesis Outline --- p.7Reference --- p.8Chapter Chapter 2 --- Cantonese --- p.9Chapter 2.1. --- The Cantonese Dialect --- p.9Chapter 2.1.1. --- Phonology --- p.10Chapter 2.1.1.1 --- Initial --- p.11Chapter 2.1.1.2 --- Final --- p.12Chapter 2.1.1.3 --- Tone --- p.13Chapter 2.1.2. --- Phonological Constraints --- p.14Chapter 2.2. --- Tones in Cantonese --- p.15Chapter 2.2.1. --- Tone System --- p.15Chapter 2.2.2. --- Linguistic Significance --- p.18Chapter 2.2.3. --- Acoustical Realization --- p.18Chapter 2.3. --- Prosodic Variation in Continuous Cantonese Speech --- p.20Chapter 2.4. --- Cantonese Speech Corpus - CUProsody --- p.21Reference --- p.23Chapter Chapter 3 --- F0 Normalization --- p.25Chapter 3.1. --- F0 in Speech Production --- p.25Chapter 3.2. --- F0 Extraction --- p.27Chapter 3.3. --- Duration-normalized Tone Contour --- p.29Chapter 3.4. --- F0 Normalization --- p.30Chapter 3.4.1. --- Necessity and Motivation --- p.30Chapter 3.4.2. --- F0 Normalization --- p.33Chapter 3.4.2.1 --- Methodology --- p.33Chapter 3.4.2.2 --- Assumptions --- p.34Chapter 3.4.2.3 --- Estimation of Relative Tone Ratios --- p.35Chapter 3.4.2.4 --- Derivation of Phrase Curve --- p.37Chapter 3.4.2.5 --- Normalization of Absolute FO Values --- p.39Chapter 3.4.3. --- Experiments and Discussion --- p.39Chapter 3.5. --- Conclusions --- p.44Reference --- p.45Chapter Chapter 4 --- Acoustical FO Analysis --- p.48Chapter 4.1. --- Methodology of FO Analysis --- p.48Chapter 4.1.1. --- Analysis-by-Synthesis --- p.48Chapter 4.1.2. --- Acoustical Analysis --- p.51Chapter 4.2. --- Acoustical FO Analysis for Cantonese --- p.52Chapter 4.2.1. --- Analysis of Phrase Curves --- p.52Chapter 4.2.2. --- Analysis of Tone Contours --- p.55Chapter 4.2.2.1 --- Context-independent Single-tone Contours --- p.56Chapter 4.2.2.2 --- Contextual Variation --- p.58Chapter 4.2.2.3 --- Co-articulated Tone Contours of Disyllabic Word --- p.59Chapter 4.2.2.4 --- Cross-word Contours --- p.62Chapter 4.2.2.5 --- Phrase-initial Tone Contours --- p.65Chapter 4.3. --- Summary --- p.66Reference --- p.67Chapter Chapter5 --- Prosody Modeling for Cantonese Text-to-Speech --- p.70Chapter 5.1. --- Parametric Model and Non-parametric Model --- p.70Chapter 5.2. --- Cantonese Text-to-Speech: Baseline System --- p.72Chapter 5.2.1. --- Sub-syllable Unit --- p.72Chapter 5.2.2. --- Text Analysis Module --- p.73Chapter 5.2.3. --- Acoustical Synthesis --- p.74Chapter 5.2.4. --- Prosody Module --- p.74Chapter 5.3. --- Enhanced Prosody Model --- p.74Chapter 5.3.1. --- Modeling Tone Contours --- p.75Chapter 5.3.1.1 --- Word-level FO Contours --- p.76Chapter 5.3.1.2 --- Phrase-initial Tone Contours --- p.77Chapter 5.3.1.3 --- Tone Contours at Word Boundary --- p.78Chapter 5.3.2. --- Modeling Phrase Curves --- p.79Chapter 5.3.3. --- Generation of Continuous FO Contours --- p.81Chapter 5.4. --- Summary --- p.81Reference --- p.82Chapter Chapter 6 --- Performance Evaluation --- p.83Chapter 6.1. --- Introduction to Perceptual Test --- p.83Chapter 6.1.1. --- Aspects of Evaluation --- p.84Chapter 6.1.2. --- Methods of Judgment Test --- p.84Chapter 6.1.3. --- Problems in Perceptual Test --- p.85Chapter 6.2. --- Perceptual Tests for Cantonese TTS --- p.86Chapter 6.2.1. --- Intelligibility Tests --- p.86Chapter 6.2.1.1 --- Method --- p.86Chapter 6.2.1.2 --- Results --- p.88Chapter 6.2.1.3 --- Analysis --- p.89Chapter 6.2.2. --- Naturalness Tests --- p.90Chapter 6.2.2.1 --- Word-level --- p.90Chapter 6.2.2.1.1 --- Method --- p.90Chapter 6.2.2.1.2 --- Results --- p.91Chapter 6.2.3.1.3 --- Analysis --- p.91Chapter 6.2.2.2 --- Sentence-level --- p.92Chapter 6.2.2.2.1 --- Method --- p.92Chapter 6.2.2.2.2 --- Results --- p.93Chapter 6.2.2.2.3 --- Analysis --- p.94Chapter 6.3. --- Conclusions --- p.95Chapter 6.4. --- Summary --- p.95Reference --- p.96Chapter Chapter 7 --- Conclusions and Future Work --- p.97Chapter 7.1. --- Conclusions --- p.97Chapter 7.2. --- Suggested Future Work --- p.99Appendix --- p.100Appendix 1 Linear Regression --- p.100Appendix 2 36 Templates of Cross-word Contours --- p.101Appendix 3 Word List for Word-level Tests --- p.102Appendix 4 Syllable Occurrence in Word List of Intelligibility Test --- p.108Appendix 5 Wrongly Identified Word List --- p.112Appendix 6 Confusion Matrix --- p.115Appendix 7 Unintelligible Word List --- p.117Appendix 8 Noisy Word List --- p.119Appendix 9 Sentence List for Naturalness Test --- p.12
PoeticTTS -- Controllable Poetry Reading for Literary Studies
Speech synthesis for poetry is challenging due to specific intonation
patterns inherent to poetic speech. In this work, we propose an approach to
synthesise poems with almost human like naturalness in order to enable literary
scholars to systematically examine hypotheses on the interplay between text,
spoken realisation, and the listener's perception of poems. To meet these
special requirements for literary studies, we resynthesise poems by cloning
prosodic values from a human reference recitation, and afterwards make use of
fine-grained prosody control to manipulate the synthetic speech in a
human-in-the-loop setting to alter the recitation w.r.t. specific phenomena. We
find that finetuning our TTS model on poetry captures poetic intonation
patterns to a large extent which is beneficial for prosody cloning and
manipulation and verify the success of our approach both in an objective
evaluation as well as in human studies.Comment: Accepted to Interspeech 202
- …