Search CORE

104 research outputs found

SMaTTS: standard malay text to speech system

Author: Ahmad Zakiah Hanim
Gunawan Teddy Surya
Khalifa Othman Omran
Publication venue: 'International Research Publication House'
Publication date: 01/01/2007
Field of study

This paper presents a rule-based text- to- speech (TTS) Synthesis System for Standard Malay, namely SMaTTS. The proposed system using sinusoidal method and some pre- recorded wave files in generating speech for the system. The use of phone database significantly decreases the amount of computer memory space used, thus making the system very light and embeddable. The overall system was comprised of two phases the Natural Language Processing (NLP) that consisted of the high-level processing of text analysis, phonetic analysis, text normalization and morphophonemic module. The module was designed specially for SM to overcome few problems in defining the rules for SM orthography system before it can be passed to the DSP module. The second phase is the Digital Signal Processing (DSP) which operated on the low-level process of the speech waveform generation. A developed an intelligible and adequately natural sounding formant-based speech synthesis system with a light and user-friendly Graphical User Interface (GUI) is introduced. A Standard Malay Language (SM) phoneme set and an inclusive set of phone database have been constructed carefully for this phone-based speech synthesizer. By applying the generative phonology, a comprehensive letter-to-sound (LTS) rules and a pronunciation lexicon have been invented for SMaTTS. As for the evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was compiled and several experiments have been performed to evaluate the quality of the synthesized speech by analyzing the Mean Opinion Score (MOS) obtained. The overall performance of the system as well as the room for improvements was thoroughly discussed

CiteSeerX

The International Islamic University Malaysia Repository

Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /

Author: Kulekci Oguzhan M.
Külekci Oğuzhan M.
Publication venue
Publication date: 01/01/2006
Field of study

The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers

Sabanci University Research Database

Generation of prosody and speech for Mandarin Chinese

Author: DONG MINGHUI
Publication venue
Publication date: 19/02/2004
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Системи анотування просодії китайської мови

Author: Гобова Є.
Publication venue: Інститут сходознавства ім. А.Ю. Кримського НАН України
Publication date: 01/01/2011
Field of study

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Author: Jiang Ziyue
Liu Jinglin
Ren Yi
Yang Qian
Ye Zhenhui
Zhao Zhou
Zhe Su
Publication venue
Publication date: 09/10/2022
Field of study

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective. The code is available at \url{https://github.com/Zain-Jiang/Dict-TTS}.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Evaluation of automatic break insertion for an agglutinative and inflected language

Author: Agresti
Allen
Bachenko
Blum
Breiman
Carletta
Eva Navas
Frazier
Hirschberg
Inmaculada Hernáez
Iñaki Sainz
Liberman
Maragoudakis
Oparin
Ostendorf
Read
Salton
Sangho
Siegel
Stone
Taylor
van Rijsbergen
Wang
Yoon
Zellner
Zervas
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Recommended from our members

Using Linguistic Features to Improve Prosody for Text-to-Speech

Author: Sloan Rose
Publication venue
Publication date: 01/01/2023
Field of study

This thesis focuses on the problem of using text-to-speech (TTS) to synthesize speech with natural-sounding prosody. I propose a two-step process for approaching this problem. In the first step, I train text-based models to predict the locations of phrase boundaries and pitch accents in an utterance. Because these models use only text features, they can be used to predict the locations of prosodic events in novel utterances. In the second step, I incorporate these prosodic events into a text-to-speech pipeline in order to produce prosodically appropriate speech. I trained models for predicting phrase boundaries and pitch accents on utterances from a corpus of radio news data. I found that the strongest models used a large variety of features, including syntactic features, lexical features, word embeddings, and co-reference features. In particular, using a large variety of syntactic features improved performance on both tasks. These models also performed well when tested on a different corpus of news data. I then trained similar models on two conversational corpora: one a corpus of task-oriented dialogs and one a corpus of open-ended conversations. I again found that I could train strong models by using a wide variety of linguistic features, although performance dropped slightly in cross-corpus applications, and performance was very poor in cross-genre applications. For conversational speech, syntactic features continued to be helpful for both tasks. Additionally, word embedding features were particularly helpful in the conversational domain. Interestingly, while it is generally believed that given information (i.e., terms that have recently been referenced) is often de-accented, for all three corpora, I found that including co-reference features only slightly improved the pitch accent detection model. I then trained a TTS system on the same radio news corpus using Merlin, an open source DNN-based toolkit for TTS. As Merlin includes a linguistic feature extraction step before training, I added two additional features: one for phrase boundaries (distinguishing between sentence boundaries and mid-sentence phrase boundaries) and one for pitch accents. The locations of all breaks and accents for all test and training data were determined using the text-based prosody prediction models. I found that the pipeline using these new features produced speech that slightly outperformed the baseline on objective metrics such as mel-cepstral distortion (MCD) and was greatly preferred by listeners in a subjective listening test. Finally, I trained an end-to-end TTS system on data that included phrase boundaries. The model was trained on a corpus of read speech, with the locations of phrase boundaries predicted based on acoustic features, and tested on radio news stories, with phrase boundaries predicted using the text-based model. I found that including phrase boundaries lowered MCD between the synthesized speech and the original radio broadcast, as compared to the baseline, but the results of a listening test were inconclusive

Columbia University Academic Commons