1,277 research outputs found

    A Finite State and Data-Oriented Method for Grapheme to Phoneme Conversion

    Full text link
    A finite-state method, based on leftmost longest-match replacement, is presented for segmenting words into graphemes, and for converting graphemes into phonemes. A small set of hand-crafted conversion rules for Dutch achieves a phoneme accuracy of over 93%. The accuracy of the system is further improved by using transformation-based learning. The phoneme accuracy of the best system (using a large set of rule templates and a `lazy' variant of Brill's algoritm), trained on only 40K words, reaches 99% accuracy.Comment: 8 page

    Text Preprocessing for Speech Synthesis

    Get PDF
    In this paper we describe our text preprocessing modules for English text-to-speech synthesis. These modules comprise rule-based text normalization subsuming sentence segmentation and normalization of non-standard words, statistical part-of-speech tagging, and statistical syllabification, grapheme-to-phoneme conversion, and word stress assignment relying in parts on rule-based morphological analysis

    Synchronizing Keyframe Facial Animation to Multiple Text-to-Speech Engines and Natural Voice with Fast Response Time

    Get PDF
    This thesis aims to create an automated lip-synchronization system for real-time applications. Specifically, the system is required to be fast, consist of a limited number of keyframes with small memory requirements, and create fluid and believable animations that synchronize with text-to-speech engines as well as raw voice data. The algorithms utilize traditional keyframe animation and a novel method of keyframe selection. Additionally, phoneme-to-keyframe mapping, synchronization, and simple blending rules are employed. The algorithms provide blending between keyframe images, borrow information from neighboring phonemes, accentuate phonemes b, p and m, differentiate between keyframes for phonemes with allophonic variations, and provide prosodromic variation by including emotion while speaking. The lip-sync animation synchronizes with multiple synthesized voices and human speech. A fast and versatile online real-time java chat interface is created to exhibit vivid facial animation. Results show that the animation algorithms are fast and show accurate lip-synchronization. Additionally, surveys showed that the animations are visually pleasing and improve speech understandability 96% of the time. Applications for this project include internet chat capabilities, interactive teaching of foreign languages, animated news broadcasting, enhanced game technology, and cell phone messaging

    Nativization of English words in Spanish using analogy

    Get PDF
    Nowadays modern speech technologies need to be flexible and adaptable to any framework. Mass media globalization introduces the challenge of multilingualism into most popular speech applications such as text-to-speech synthesis and automatic speech recognition. Mixed-language texts vary in their nature and when processed, some essential characteristics ought to be considered. In Spain, the usage of English and other foreign origin words is growing as well as in other countries. The particularity of the peninsular Spanish is that there is a tendency to nativized foreign words pronunciation so that they fit in properly into Spanish phonetics. In this work our goal was to approach the nativization challenge by data-driven methods, since they are transferable to other languages and do not yield in performance. Training and test corpora for nativization were manually crafted and the experiments were carried out using pronunciation by analogy. The results obtained were encouraging and proved that even a small training corpus of 1000 words allows obtaining a higher level of intelligibility for English inclusions in Spanish utterances.Peer ReviewedPostprint (published version

    Understanding DIBELS: Purposes, Limitations, and Alignment of Literacy Constructs to Subtest Measures

    Get PDF
    DIBELS Next is frequently used as a universal screening and progress monitoring tool within a Response to Intervention (RTI) framework. Unfortunately, some misguided educational professionals are not utilizing the assessments as they have been intended, resulting in defective instructional practices and faulty decision-making. In order for DIBELS to be used effectively, teachers must have advanced knowledge regarding assessment practices, understand data analysis and interpretation, and deliver instruction that can positively influence the reading development of at-risk learners. The intent of this project is to provide educators with an understanding of the appropriate uses and limitations of DIBELS. Additionally, this project sets out to align each DIBELS subtest with its corresponding literacy construct. The concepts of phonemic awareness, phonics, and reading fluency are fully defined and general instructional recommendations are provided for each. Finally, a sample of teaching strategies that can be utilized to support the needs of students experiencing difficulties in each of these areas is highlighted

    End-to-End Attention-based Large Vocabulary Speech Recognition

    Full text link
    Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the desired character sequence is learned automatically by an attention mechanism built into the RNN. For each predicted character, the attention mechanism scans the input sequence and chooses relevant frames. We propose two methods to speed up this operation: limiting the scan to a subset of most promising frames and pooling over time the information contained in neighboring frames, thereby reducing source sequence length. Integrating an n-gram language model into the decoding process yields recognition accuracies similar to other HMM-free RNN-based approaches

    MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

    Full text link
    In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.Comment: Accepted for publication at EMNLP 202

    Developing a Text to Speech System for Dzongkha

    Get PDF
    Text to Speech plays a vital role in imparting information to the general population who have difficulty reading text but can understand spoken language. In Bhutan, many people fall in this category in adopting the national language ‘Dzongkha’ and system of such kind will have advantages in the community. In addition, the language will heighten its digital evolution in narrowing the digital gap. The same is more important in helping people with visual impairment. Text to speech systems are widely used in talking BOTs to news readers and announcement systems. This paper presents an attempt towards developing a working model of Text to Speech system for Dzongkha language. It also presents the development of a transcription or grapheme table for phonetic transcription from Dzongkha text to its equivalent phone set. The transcription tables for both consonants and vowels have been prepared in such a way that it facilitates better compatibility in computing. A total of 3000 sentences have been manually transcribed and recorded with a single male voice. The speech synthesis is based on a statistical method with concatenative speech generation on FESTIVAL platform. The model is generated using the two variants CLUSTERGEN and CLUNITS of the FESTIVAL speech tools FESTVOX. The development of system prototype is of the first kind for the Dzongkha language. Keywords: Natural Language processing (NLP), Dzongkha, Text to speech (TTS) system, Statistical speech synthesis, phoneme, corpus, transcription DOI: 10.7176/CEIS/12-1-04 Publication date: January 31st 202
    corecore