294 research outputs found

    Intonation modeling for Indian languages

    Get PDF
    Abstract In this paper we propose models for predicting the intonation for the sequence of syllables present in the utterance. The term intonation refers to the temporal changes of the fundamental frequency ðF 0 Þ. Neural networks are used to capture the implicit intonation knowledge in the sequence of syllables of an utterance. We focus on the development of intonation models for predicting the sequence of fundamental frequency values for a given sequence of syllables. Labeled broadcast news data in the languages Hindi, Telugu and Tamil is used to develop neural network models in order to predict the F 0 of syllables in these languages. The input to the neural network consists of a feature vector representing the positional, contextual and phonological constraints. The interaction between duration and intonation constraints can be exploited for improving the accuracy further. From the studies we find that 88% of the F 0 values (pitch) of the syllables could be predicted from the models within 15% of the actual F 0 . The performance of the intonation models is evaluated using objective measures such as average prediction error ðlÞ, standard deviation ðrÞ and correlation coefficient ðcÞ. The prediction accuracy of the intonation models is further evaluated using listening tests. The prediction performance of the proposed intonation models using neural networks is compared with Classification and Regression Tree (CART) models

    SMaTTS: standard malay text to speech system

    Get PDF
    This paper presents a rule-based text- to- speech (TTS) Synthesis System for Standard Malay, namely SMaTTS. The proposed system using sinusoidal method and some pre- recorded wave files in generating speech for the system. The use of phone database significantly decreases the amount of computer memory space used, thus making the system very light and embeddable. The overall system was comprised of two phases the Natural Language Processing (NLP) that consisted of the high-level processing of text analysis, phonetic analysis, text normalization and morphophonemic module. The module was designed specially for SM to overcome few problems in defining the rules for SM orthography system before it can be passed to the DSP module. The second phase is the Digital Signal Processing (DSP) which operated on the low-level process of the speech waveform generation. A developed an intelligible and adequately natural sounding formant-based speech synthesis system with a light and user-friendly Graphical User Interface (GUI) is introduced. A Standard Malay Language (SM) phoneme set and an inclusive set of phone database have been constructed carefully for this phone-based speech synthesizer. By applying the generative phonology, a comprehensive letter-to-sound (LTS) rules and a pronunciation lexicon have been invented for SMaTTS. As for the evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was compiled and several experiments have been performed to evaluate the quality of the synthesized speech by analyzing the Mean Opinion Score (MOS) obtained. The overall performance of the system as well as the room for improvements was thoroughly discussed

    A preliminary bibliography on focus

    Get PDF
    [I]n its present form, the bibliography contains approximately 1100 entries. Bibliographical work is never complete, and the present one is still modest in a number of respects. It is not annotated, and it still contains a lot of mistakes and inconsistencies. It has nevertheless reached a stage which justifies considering the possibility of making it available to the public. The first step towards this is its pre-publication in the form of this working paper. […] The bibliography is less complete for earlier years. For works before 1970, the bibliographies of Firbas and Golkova 1975 and Tyl 1970 may be consulted, which have not been included here

    Contribution of voice fundamental frequency and formants to the identification of speaker's gender

    Get PDF
    Identification of gender from speech sounds has been found to rely on speakers’ voice fundamental frequency (F0) and formant frequencies. The present study aims at examining the contribution of F0 and formants to the correct detection of speaker’s gender. Based on the vowel sustained by a male and female speaker, 200 vowels were synthesized with a range of F0-formant combinations. The synthesized vowels were presented to 28 native Cantonese-speaking listeners to judge the perceived speakers’ gender for each of the synthesized stimuli. Results revealed that F0 was the primary cue for speakers’ gender perception while formants contributed little. The cutoff F0 values for male and female identification were found to be 162.01 Hz and 204.97 Hz, respectively. When F0 was below 162.01 Hz or above 204.97 Hz, listeners reliably and correctly identified the speakers as male or female, respectively.published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science

    Speech Communication

    Get PDF
    Contains research objectives and summary of research on three research projects and reports on three research projects.National Institutes of Health (Grant 5 RO1 NS04332-12)U. S. Navy Office of Naval Research (Contract ONR N00014-67-A-0204-0069)Joint Services Electronics Program (Contract DAAB07-74-C-0630)National Institutes of Health (Grant 2 RO1 NS04332-11

    Text to speech for Bangla language using festival

    Get PDF
    Includes bibliographical references (page 6-7).In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the open-source Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input text, language modeling, and speech synthesis using its signal processing module. The Bangla TTS system proposed here, creates the voice data for festival, and additionally extends festival using its embedded scheme scripting interface to incorporate Bangla language support. Festival is a oncatenative TTS system using diphone or other unit selection speech units. Our TTS implementation uses two different kinds of these concatenative methods supported in Festival: unit selection and multisyn unit selection. The function of a Text-to-Speech system is to convert some language text into its spoken equivalent by a series of modules. These modules, constituting the TTS system are described in detail which is very much helpful for future development. Finally, the quality of synthesized speech is assessed in terms of acceptability and intelligibility

    Consonantal F0 perturbation in American English involves multiple mechanisms.

    Get PDF
    In this study, we revisit consonantal perturbation of F0 in English, taking into particular consideration the effect of alignment of F0 contours to segments and the F0 extraction method in the acoustic analysis. We recorded words differing in consonant voicing, manner of articulation, and position in syllable, spoken by native speakers of American English in both statements and questions. In the analysis, we compared methods of F0 alignment and found that the highest F0 consistency occurred when F0 contours were time-normalized to the entire syllable. Applying this method, along with using syllables with nasal consonants as the baseline and a fine-detailed F0 extraction procedure, we identified three distinct consonantal effects: a large but brief (10-40 ms) F0 raising at voice onset regardless of consonant voicing, a smaller but longer-lasting F0 raising effect by voiceless consonants throughout a large proportion of the following vowels, and a small lowering effect of around 6 Hz by voiced consonants, which was not found in previous studies. Additionally, a brief anticipatory effect was observed before a coda consonant. These effects are imposed on a continuously changing F0 curve that is either rising-falling or falling-rising, depending on whether the carrier sentence is a statement or a question

    Text to speech synthesis for Ghanaian local languages (Twi)

    Get PDF
    Thesis submitted to the Department of Computer Science, Ashesi University College, in partial fulfillment of Bachelor of Science degree in Computer Science, April 2016Businesses in Ghana do communicate with each other through letters, telephones, fax, and emails. Due to the growth of technology, most of these businesses are providing their services online to make it convenient for customers. However, these mediums of communication have some limitations when it comes to language. In the instance where the parties involved in the communication speak and understand different languages, it becomes difficult to understand. Businesses would like to reach out to these people in a language they can understand. Statistics have shown that almost half of the Ghanaian Adult (15 and above) population are illiterates however they also need to do business as well to take care of their family. The challenge is that due to their inability to understand the message, either in text or audio, they tend to be left out in this age of technology growth. As a solution, this paper intends to develop a text-to-speech application system in Twi which can be used to develop prompts for Interactive Voice Response.Ashesi University Colleg

    Focus perception in Japanese: Effects of lexical accent and focus location.

    Get PDF
    This study explored the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Using a 4AFC identification task, we compared native Japanese listeners' focus identification accuracy in different lexical accent × focus location conditions using resynthesised speech stimuli, which varied only in fundamental frequency. Experiment 1 compared the identification accuracy in lexical accent × focus location conditions using both natural and resynthesised stimuli. The results showed that focus identification rates were similar with the two stimulus types, thus establishing the reliability of the resynthesised stimuli. Experiment 2 explored these conditions further using only resynthesised stimuli. Narrow foci bearing the lexical pitch accent were always more correctly identified than unaccented ones, whereas the identification rate for final focus was the lowest among all focus locations. From these results, we argue that the difficulty of focus perception in Japanese is attributed to (i) the blocking of PFC by unaccented words, and (ii) similarity in F0 contours between lexical pitch accent and narrow focus, including in particular the similarity between downstep and PFC. Focus perception is therefore contingent on other concurrent communicative functions which may sometimes take precedence in a +PFC language

    Voice Conversion by Prosody and Vocal Tract Modification

    Full text link
    In this paper we proposed some exible methods, which are useful in the process of voice conversion. The pro-posed methods modify the shape of the vocal tract system and the characteristics of the prosody according to the de-sired requirement. The shape of the vocal tract system is modied by shifting the major resonant frequencies (for-mants) of the short term spectrum, and altering their band-widths accordingly. In the case of prosody modication, the required durational and intonational characteristics are im-posed on the given speech signal. In the proposed method, the prosodic characteristics are manipulated using instants of signicant excitation. The instants of signicant excita-tion correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excita-tions like onset of burst in the case of nonvoiced speech. Instants of signicant excitation are computed from the Lin-ear Prediction (LP) residual of the speech signals by using the property of average group delay of minimum phase sig-nals. The manipulations of durational characteristics and pitch contour (intonation pattern) are achieved by manipu-lating the LP residual with the help of the knowledge of the instants of signicant excitation. The modied LP residual is used to excite the time varying lter. The lter parameters are updated according to the desired vocal tract characteris-tics. The proposed methods are evaluated using listening tests. 1
    corecore