294 research outputs found
Intonation modeling for Indian languages
Abstract In this paper we propose models for predicting the intonation for the sequence of syllables present in the utterance. The term intonation refers to the temporal changes of the fundamental frequency ðF 0 Þ. Neural networks are used to capture the implicit intonation knowledge in the sequence of syllables of an utterance. We focus on the development of intonation models for predicting the sequence of fundamental frequency values for a given sequence of syllables. Labeled broadcast news data in the languages Hindi, Telugu and Tamil is used to develop neural network models in order to predict the F 0 of syllables in these languages. The input to the neural network consists of a feature vector representing the positional, contextual and phonological constraints. The interaction between duration and intonation constraints can be exploited for improving the accuracy further. From the studies we find that 88% of the F 0 values (pitch) of the syllables could be predicted from the models within 15% of the actual F 0 . The performance of the intonation models is evaluated using objective measures such as average prediction error ðlÞ, standard deviation ðrÞ and correlation coefficient ðcÞ. The prediction accuracy of the intonation models is further evaluated using listening tests. The prediction performance of the proposed intonation models using neural networks is compared with Classification and Regression Tree (CART) models
SMaTTS: standard malay text to speech system
This paper presents a rule-based text- to- speech
(TTS) Synthesis System for Standard Malay, namely SMaTTS. The
proposed system using sinusoidal method and some pre- recorded
wave files in generating speech for the system. The use of phone
database significantly decreases the amount of computer memory
space used, thus making the system very light and embeddable. The
overall system was comprised of two phases the Natural Language
Processing (NLP) that consisted of the high-level processing of text
analysis, phonetic analysis, text normalization and morphophonemic
module. The module was designed specially for SM to overcome
few problems in defining the rules for SM orthography system before
it can be passed to the DSP module. The second phase is the Digital
Signal Processing (DSP) which operated on the low-level process of
the speech waveform generation. A developed an intelligible and
adequately natural sounding formant-based speech synthesis system
with a light and user-friendly Graphical User Interface (GUI) is
introduced. A Standard Malay Language (SM) phoneme set and an
inclusive set of phone database have been constructed carefully for
this phone-based speech synthesizer. By applying the generative
phonology, a comprehensive letter-to-sound (LTS) rules and a
pronunciation lexicon have been invented for SMaTTS. As for the
evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was
compiled and several experiments have been performed to evaluate
the quality of the synthesized speech by analyzing the Mean Opinion
Score (MOS) obtained. The overall performance of the system as
well as the room for improvements was thoroughly discussed
A preliminary bibliography on focus
[I]n its present form, the bibliography contains approximately 1100 entries. Bibliographical work is never complete, and the present one is still modest in a number of respects. It is not annotated, and it still contains a lot of mistakes and inconsistencies. It has nevertheless reached a stage which justifies considering the possibility of making it available to the public. The first step towards this is its pre-publication in the form of this working paper. […]
The bibliography is less complete for earlier years. For works before 1970, the bibliographies of Firbas and Golkova 1975 and Tyl 1970 may be consulted, which have not been included here
Contribution of voice fundamental frequency and formants to the identification of speaker's gender
Identification of gender from speech sounds has been found to rely on speakers’ voice fundamental frequency (F0) and formant frequencies. The present study aims at examining the contribution of F0 and formants to the correct detection of speaker’s gender. Based on the vowel sustained by a male and female speaker, 200 vowels were synthesized with a range of F0-formant combinations. The synthesized vowels were presented to 28 native Cantonese-speaking listeners to judge the perceived speakers’ gender for each of the synthesized stimuli. Results revealed that F0 was the primary cue for speakers’ gender perception while formants contributed little. The cutoff F0 values for male and female identification were found to be 162.01 Hz and 204.97 Hz, respectively. When F0 was below 162.01 Hz or above 204.97 Hz, listeners reliably and correctly identified the speakers as male or female, respectively.published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science
Speech Communication
Contains research objectives and summary of research on three research projects and reports on three research projects.National Institutes of Health (Grant 5 RO1 NS04332-12)U. S. Navy Office of Naval Research (Contract ONR N00014-67-A-0204-0069)Joint Services Electronics Program (Contract DAAB07-74-C-0630)National Institutes of Health (Grant 2 RO1 NS04332-11
Text to speech for Bangla language using festival
Includes bibliographical references (page 6-7).In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the open-source Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input text, language modeling, and speech synthesis using its signal processing module. The Bangla TTS system proposed here, creates the voice data for festival, and additionally extends festival using its embedded scheme
scripting interface to incorporate Bangla language support. Festival is a oncatenative TTS system using diphone or other unit selection speech units. Our TTS implementation uses two different kinds of these concatenative methods supported in Festival: unit selection and multisyn unit selection. The function of a Text-to-Speech system is to convert some language
text into its spoken equivalent by a series of modules. These modules, constituting the TTS system are described in detail which is very much helpful for future development. Finally, the quality of synthesized speech is assessed in terms of acceptability and intelligibility
Consonantal F0 perturbation in American English involves multiple mechanisms.
In this study, we revisit consonantal perturbation of F0 in English, taking into particular consideration the effect of alignment of F0 contours to segments and the F0 extraction method in the acoustic analysis. We recorded words differing in consonant voicing, manner of articulation, and position in syllable, spoken by native speakers of American English in both statements and questions. In the analysis, we compared methods of F0 alignment and found that the highest F0 consistency occurred when F0 contours were time-normalized to the entire syllable. Applying this method, along with using syllables with nasal consonants as the baseline and a fine-detailed F0 extraction procedure, we identified three distinct consonantal effects: a large but brief (10-40 ms) F0 raising at voice onset regardless of consonant voicing, a smaller but longer-lasting F0 raising effect by voiceless consonants throughout a large proportion of the following vowels, and a small lowering effect of around 6 Hz by voiced consonants, which was not found in previous studies. Additionally, a brief anticipatory effect was observed before a coda consonant. These effects are imposed on a continuously changing F0 curve that is either rising-falling or falling-rising, depending on whether the carrier sentence is a statement or a question
Text to speech synthesis for Ghanaian local languages (Twi)
Thesis submitted to the Department of Computer Science, Ashesi University College, in partial fulfillment of Bachelor of Science degree in Computer Science, April 2016Businesses in Ghana do communicate with each other through letters, telephones, fax,
and emails. Due to the growth of technology, most of these businesses are providing their
services online to make it convenient for customers. However, these mediums of
communication have some limitations when it comes to language. In the instance where the
parties involved in the communication speak and understand different languages, it becomes
difficult to understand. Businesses would like to reach out to these people in a language they
can understand. Statistics have shown that almost half of the Ghanaian Adult (15 and above)
population are illiterates however they also need to do business as well to take care of their
family. The challenge is that due to their inability to understand the message, either in text or
audio, they tend to be left out in this age of technology growth. As a solution, this paper
intends to develop a text-to-speech application system in Twi which can be used to develop
prompts for Interactive Voice Response.Ashesi University Colleg
Focus perception in Japanese: Effects of lexical accent and focus location.
This study explored the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Using a 4AFC identification task, we compared native Japanese listeners' focus identification accuracy in different lexical accent × focus location conditions using resynthesised speech stimuli, which varied only in fundamental frequency. Experiment 1 compared the identification accuracy in lexical accent × focus location conditions using both natural and resynthesised stimuli. The results showed that focus identification rates were similar with the two stimulus types, thus establishing the reliability of the resynthesised stimuli. Experiment 2 explored these conditions further using only resynthesised stimuli. Narrow foci bearing the lexical pitch accent were always more correctly identified than unaccented ones, whereas the identification rate for final focus was the lowest among all focus locations. From these results, we argue that the difficulty of focus perception in Japanese is attributed to (i) the blocking of PFC by unaccented words, and (ii) similarity in F0 contours between lexical pitch accent and narrow focus, including in particular the similarity between downstep and PFC. Focus perception is therefore contingent on other concurrent communicative functions which may sometimes take precedence in a +PFC language
Voice Conversion by Prosody and Vocal Tract Modification
In this paper we proposed some exible methods, which are useful in the process of voice conversion. The pro-posed methods modify the shape of the vocal tract system and the characteristics of the prosody according to the de-sired requirement. The shape of the vocal tract system is modied by shifting the major resonant frequencies (for-mants) of the short term spectrum, and altering their band-widths accordingly. In the case of prosody modication, the required durational and intonational characteristics are im-posed on the given speech signal. In the proposed method, the prosodic characteristics are manipulated using instants of signicant excitation. The instants of signicant excita-tion correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excita-tions like onset of burst in the case of nonvoiced speech. Instants of signicant excitation are computed from the Lin-ear Prediction (LP) residual of the speech signals by using the property of average group delay of minimum phase sig-nals. The manipulations of durational characteristics and pitch contour (intonation pattern) are achieved by manipu-lating the LP residual with the help of the knowledge of the instants of signicant excitation. The modied LP residual is used to excite the time varying lter. The lter parameters are updated according to the desired vocal tract characteris-tics. The proposed methods are evaluated using listening tests. 1
- …