244 research outputs found

    A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

    Full text link
    Speech synthesis is the artificial production of human speech. A typical text-to-speech system converts a language text into a waveform. There exist many English TTS systems that produce mature, natural, and human-like speech synthesizers. In contrast, other languages, including Arabic, have not been considered until recently. Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers. They also lack essential speech key factors such as intonation, stress, and rhythm. Different works were proposed to solve those issues, including the use of concatenative methods such as unit selection or parametric methods. However, they required a lot of laborious work and domain expertise. Another reason for such poor performance of Arabic speech synthesizers is the lack of speech corpora, unlike English that has many publicly available corpora and audiobooks. This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture. This work uses just \langle text, audio \rangle pairs with a relatively small amount of recorded audio samples with a total of 2.41 hours. It illustrates how to use English character embedding despite using diacritic Arabic characters as input and how to preprocess these audio samples to achieve the best results

    An HCI Speech-Based Architecture for Man-To-Machine and Machine-To-Man Communication in Yorùbá Language

    Get PDF
    Man communicates with man by natural language, sign language, and/or gesture but communicates with machine via electromechanical devices such as mouse, and keyboard.  These media of effecting Man-To-Machine (M2M) communication are electromechanical in nature. Recent research works, however, have been able to achieve some high level of success in M2M using natural language, sign language, and/or gesture under constrained conditions. However, machine communication with man, in reverse direction, using natural language is still at its infancy. Machine communicates with man usually in textual form. In order to achieve acceptable quality of end-to-end M2M communication, there is need for robust architecture to develop a novel speech-to-text and text-to-speech system. In this paper, an HCI speech-based architecture for Man-To-Machine and Machine-To-Man communication in Yorùbá language is proposed to carry Yorùbá people along in the advancement taking place in the world of Information Technology. Dynamic Time Warp is specified in the model to measure the similarity between the voice utterances in the sound library. In addition, Vector Quantization, Guassian Mixture Model and Hidden Markov Model are incorporated in the proposed architecture for compression and observation. This approach will yield a robust Speech-To-Text and Text-To-Speech system. Keywords: Yorùbá Language, Speech Recognition, Text-To-Speech, Man-To-Machine, Machine-To-Ma

    Marathi Speech Synthesis: A Review

    Get PDF
    This paper seeks to reveal the various aspects of Marathi Speech synthesis. This paper has reviewed research development in the International languages as well as Indian languages and then centering on the development in Marathi languages with regard to other Indian languages. It is anticipated that this work will serve to explore more in Marathi language. DOI: 10.17762/ijritcc2321-8169.15064

    SMaTTS: standard malay text to speech system

    Get PDF
    This paper presents a rule-based text- to- speech (TTS) Synthesis System for Standard Malay, namely SMaTTS. The proposed system using sinusoidal method and some pre- recorded wave files in generating speech for the system. The use of phone database significantly decreases the amount of computer memory space used, thus making the system very light and embeddable. The overall system was comprised of two phases the Natural Language Processing (NLP) that consisted of the high-level processing of text analysis, phonetic analysis, text normalization and morphophonemic module. The module was designed specially for SM to overcome few problems in defining the rules for SM orthography system before it can be passed to the DSP module. The second phase is the Digital Signal Processing (DSP) which operated on the low-level process of the speech waveform generation. A developed an intelligible and adequately natural sounding formant-based speech synthesis system with a light and user-friendly Graphical User Interface (GUI) is introduced. A Standard Malay Language (SM) phoneme set and an inclusive set of phone database have been constructed carefully for this phone-based speech synthesizer. By applying the generative phonology, a comprehensive letter-to-sound (LTS) rules and a pronunciation lexicon have been invented for SMaTTS. As for the evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was compiled and several experiments have been performed to evaluate the quality of the synthesized speech by analyzing the Mean Opinion Score (MOS) obtained. The overall performance of the system as well as the room for improvements was thoroughly discussed

    Speaker Clustering for Multilingual Synthesis

    Get PDF

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    Duration modeling using DNN for Arabic speech synthesis

    Get PDF
    International audienceDuration modeling is a key task for every parametric speech synthesis system. Though such parametric systems have been adapted to many languages, no special attention was paid to explicitly handling Arabic speech characteristics. Actually, in Arabic phoneme duration has a distinctive role, because of consonant gemination and vowel quantity. Therefore, a precise modeling of sound durations is critical. In this paper we compare several modeling of phoneme durations (including duration modeling by HTS and MERLIN toolkits), and we propose a new approach which relies on using a set of models, each one being optimal for a given phoneme class (e.g., simple consonants, geminated consonants, short vowels, and long vowels). An objective evaluation carried out on a set of test sentences shows that the proposed approach leads to a more accurate modeling of the phoneme durations

    Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

    Get PDF
    International audienceThis paper investigates the use of hidden Markov models (HMM) for Modern Standard Arabic speech synthesis. HMM-basedspeech synthesis systems require a description of each speech unit with a set of contextual features that specifies phonetic,phonological and linguistic aspects. To apply this method to Arabic language, a study of its particularities was conductedto extract suitable contextual features. Two phenomena are highlighted: vowel quantity and gemination. This work focuseson how to model geminated consonants (resp. long vowels), either considering them as fully-fledged phonemes or as thesame phonemes as their simple (resp. short) counterparts but with a different duration. Four modelling approaches have beenproposed for this purpose. Results of subjective and objective evaluations show that there is no important difference betweendifferentiating modelling units associated to geminated consonants (resp. long vowels) from modelling units associated tosimple consonants (resp. short vowels) and merging them as long as gemination and vowel quantity information is includedin the set of features

    Arabic Speech Corpus

    Get PDF
    corecore