4,816 research outputs found

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Statistical parametric speech synthesis for Ibibio

    Get PDF
    Ibibio is a Nigerian tone language, spoken in the south-east coastal region of Nigeria. Like most African languages, it is resource-limited. This presents a major challenge to conventional approaches to speech synthesis, which typically require the training of numerous predictive models of linguistic features such as the phoneme sequence (i.e., a pronunciation dictionary plus a letter-to-sound model) and prosodic structure (e.g., a phrase break predictor). This training is invariably supervised, requiring a corpus of training data labelled with the linguistic feature to be predicted. In this paper, we investigate what can be achieved in the absence of many of these expensive resources, and also with a limited amount of speech recordings. We employ a statistical parametric method, because this has been found to oļ¬€er good performance even on small corpora, and because it is able to directly learn the relationship between acoustics and whatever linguistic features are available, potentially mitigating the absence of explicit representations of intermediate linguistic layers such as prosody. We present an evaluation that compares systems that have access to varying degrees of linguistic structure. The simplest system only uses phonetic context (quinphones), and this is compared to systems with access to a richer set of context features, with or without tone marking. It is found that the use of tone marking contributes signiļ¬cantly to the quality of synthetic speech. Future work should therefore address the problem of tone assignment using a dictionary and the building of a prediction module for out-of-vocabulary words. Key words: speech synthesis, Ibibio, low-resource languages, HT

    Development of a Yoruba Text-to-Speech System Using Festival

    Get PDF
    This paper presents a Text-to-Speech (TTS) synthesis system for YorĆŗbĆ  language using the open-source Festival TTS engine. YorĆŗbĆ  being a resource scarce language like most African languages however presents a major challenge to conventional speech synthesis approaches, which typically require large corpora for the training of such system. Speech data were recorded in a quiet environment with a noise cancelling microphone on a typical multimedia computer system using the Speech Filing System software (SFS), analysed and annotated using PRAAT speech processing software. Evaluation of the system was done using the intelligibility and naturalness metrics through mean opinion score. The result shows that the level of intelligibility and naturalness of the system on word-level is 55.56% and 50% respectively, but the system performs poorly for both intelligibility and naturalness test on sentence level. Hence, there is a need for further research to improve the quality of the synthesized speech. Keywords: Text-to-Speech, Festival, YorĆŗbĆ , Syllabl

    Automatic Selection of Synthesis Units from a Large Speech Database

    Get PDF

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    An introduction to statistical parametric speech synthesis

    Get PDF

    An HCI Speech-Based Architecture for Man-To-Machine and Machine-To-Man Communication in YorĆ¹bĆ” Language

    Get PDF
    Man communicates with man by natural language, sign language, and/or gesture but communicates with machine via electromechanical devices such as mouse, and keyboard.Ā  These media of effecting Man-To-Machine (M2M) communication are electromechanical in nature. Recent research works, however, have been able to achieve some high level of success in M2M using natural language, sign language, and/or gesture under constrained conditions. However, machine communication with man, in reverse direction, using natural language is still at its infancy. Machine communicates with man usually in textual form. In order to achieve acceptable quality of end-to-end M2M communication, there is need for robust architecture to develop a novel speech-to-text and text-to-speech system. In this paper, an HCI speech-based architecture for Man-To-Machine and Machine-To-Man communication in YorĆ¹bĆ” language is proposed to carry YorĆ¹bĆ” people along in the advancement taking place in the world of Information Technology. Dynamic Time Warp is specified in the model to measure the similarity between the voice utterances in the sound library. In addition, Vector Quantization, Guassian Mixture Model and Hidden Markov Model are incorporated in the proposed architecture for compression and observation. This approach will yield a robust Speech-To-Text and Text-To-Speech system. Keywords: YorĆ¹bĆ” Language, Speech Recognition, Text-To-Speech, Man-To-Machine, Machine-To-Ma

    Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context

    Get PDF
    Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection
    • ā€¦
    corecore