320 research outputs found
Development of Text-To-Speech System for Latvian
Proceedings of the 16th Nordic Conference
of Computational Linguistics NODALIDA-2007.
Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit.
University of Tartu, Tartu, 2007.
ISBN 978-9985-4-0513-0 (online)
ISBN 978-9985-4-0514-7 (CD-ROM)
pp. 67-72
Time-domain concatenative text-to-speech synthesis.
A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures.
For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked
similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains
stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth
most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point
of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech
synthesisers, and were ranked fifth most natural out of six
SMaTTS: standard malay text to speech system
This paper presents a rule-based text- to- speech
(TTS) Synthesis System for Standard Malay, namely SMaTTS. The
proposed system using sinusoidal method and some pre- recorded
wave files in generating speech for the system. The use of phone
database significantly decreases the amount of computer memory
space used, thus making the system very light and embeddable. The
overall system was comprised of two phases the Natural Language
Processing (NLP) that consisted of the high-level processing of text
analysis, phonetic analysis, text normalization and morphophonemic
module. The module was designed specially for SM to overcome
few problems in defining the rules for SM orthography system before
it can be passed to the DSP module. The second phase is the Digital
Signal Processing (DSP) which operated on the low-level process of
the speech waveform generation. A developed an intelligible and
adequately natural sounding formant-based speech synthesis system
with a light and user-friendly Graphical User Interface (GUI) is
introduced. A Standard Malay Language (SM) phoneme set and an
inclusive set of phone database have been constructed carefully for
this phone-based speech synthesizer. By applying the generative
phonology, a comprehensive letter-to-sound (LTS) rules and a
pronunciation lexicon have been invented for SMaTTS. As for the
evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was
compiled and several experiments have been performed to evaluate
the quality of the synthesized speech by analyzing the Mean Opinion
Score (MOS) obtained. The overall performance of the system as
well as the room for improvements was thoroughly discussed
Speech synthesis based on a harmonic model
The wide range of potential commercial applications for a com puter system capable of automatically converting text to speech (TTS) has stimulated decades of research.
One of the currently most successful approaches to synthesising speech, concatenative TTS synthesis, combines prerecorded speech units to build full utterances. However, th e prosody of the stored units is often not consistent with that of the target utterance and m ust be altered. Furthermore, several types of mismatch can occur at unit boundaries and must be smoothed. Thus, pitch and time-scale modification techniques as well as smoothing algorithms play a critical role in all concatenative-based systems.
This thesis presents the developm ent of a concatenative TTS system based on a harm onic model and incorporating new pitch and time-scaling as well as smoothing algorithms.
Experim ent has shown our system capable of both very high quality prosodic modification and synthesis. Results com pare very favourably with those of existing state-of-the-art systems
A Prosodic Turkish text-to-speech synthesizer
Naturalness in Text-to-Speech systems is very important in achieving high quality waveform. The naturalness of the waveform is highly correlated with phonetic coverage and prosodic features such as, duration and F0 contour. Duration determines the timing for the synthesized phoneme, whereas F0 contour determines fundamental frequency component of the waveform. This thesis presents the development of a prosodic Text-to-Speech System for Turkish Language using the Festival Tool [31]. We describe a complete realization of a new male voice, covering allophones of Turkish using duration and F0 parameters. The duration of the allophones and the word stress have been studied extensively. Sentence stress and phrasal stress are also discussed by in less detail. Carrier words are designed approximately for all allophone-allophone combinations. 1680 carrier words are recorded in a sound-proof recording studio. LPC (linear predictive coding) and RES (residual) parameters are computed. The text normalisation module is implemented for abbreviations and numbers. Durations for the allophones are entered. Sentence level and word level F0 generation modules are implemented. By increasing the number of phonemes and giving prosody we obtained a more natural sounding Text-to-Speech System for Turkish Language
TEXT-TO-SPEECH SYNTHESIS: A PROTOTYPE SYSTEM FOR CROATIAN LANGUAGE
U radu je prikazan sustav koji omogućuje umjetnu tvorbu hrvatskoga govora prema proizvoljnom ulaznom tekstu. Ulazni tekst, koji mora biti u normaliziranom obliku, sustav pretvara u niz fonema (pretvorba grafem-fonem), a zatim stvara zvučni zapis na temelju fonetskoga niza. Korišteni postupak sinteze temelji se na ulančavanju manjih akustičkih jedinica govora – difona metodom TD-PSOLA. Za potrebe sustava izrađena je i baza difona za hrvatski govor. Predložen je automatski postupak odabira difona iz govornoga korpusa.
Kvaliteta ostvarenoga postupka ispitana je provođenjem ankete među ispitanicima. Ispitanici su dali subjektivnu ocjenu kvalitete dobivenoga govora, a time je provjerena i njegova razumljivost.This paper presents the development of a Croatian text-to-speech system capable of synthesizing speech from arbitrary text. Input text in normalized form is first transcribed into a phonetic string (grapheme-to-phoneme conversion) and then processed by a TD-PSOLA based synthesizer. A procedure for automatic selection of diphones from a spoken corpus is proposed. A Croatian language diphone database was built for the system. Subjective quality evaluations of the resulting speech were performed, as well as tests for intelligibility
- …