8 research outputs found

    Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion

    Get PDF
    In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion

    Creation of HMM-based Speech Model for Estonian Text-to-Speech Synthesis

    Get PDF
    Antud bakalaureusetöös antakse ülevaate Markovi peitmudelitel põhineva häälemudeli loomisest eestikeelse kõnesünteesi rakenduste jaoks. Esmalt tutvustatakse tekst-kõne sünteesi protsessi, kirjeldati tüüpilise sünteesisüsteemi komponente ning vaadeldakse enamlevinud paradigmade lähenemist kõnesünteesile. Täpsemalt käsitletakse statistilist parameetrilist kõnesünteesi ja selgitatakse antud töö raames kasutatud Markovi peitmudelitel põhineva sünteesisüsteemi HTS toimimismehhanisme, antakse ülevaade tema eelistest ja puudustest ning võimalikest probleemilahendustest. Praktilises osas kasutatakse Eesti Keele Instituudis koostatud ja salvestatud kõnekorpust. Välja tuuakse korpuse loomise põhimõtted ning seos kõnesünteesisüsteemi lingvistilise töötluse mooduliga ning sellest tulenevad piirangud. Kirjeldatakse tekstianalüüsi arendamisega kaasnenud muutusi häälikusüsteemi valikul. Ära märgitakse kõnekorpuse salvestamisega seotud aspektid ja materjalide hindamise põhimõtted ning analüüsitakse korpuse kvaliteeti mõjutanud leide, millest tulenevalt on muudetud järgnevate korpuste koostamise põhimõtteid. Töö eesmärgiks olnud häälemudeli loomisel tuuakse esmalt välja süsteemi HTS kohandamine eesti keelele, mis sisuliselt tähendab foneetilise ja fonoloogilise spetsifikatsiooni koostamist ja treeningmaterjalide ettevalmistamist. Kuna soovitakse võtta häälemudel kasutusele eestikeelse kõnesünteesi rakendustes, tuleb spetsifikatsioon ühildada saadaval oleva tekstianalüüsi omaga. Katseid tehakse erinevate kõnejuhtide erinevate alamkorpustega ja eksperimenteeritakse lingvistilise spetsifikatsiooniga. Välja tuuakse mees- ja naishäälele treenitud mudelitega genereeritud sünteeskõne näited, mille põhjal antakse ka hinnang mudelite headusele. Ootuspärase tulemusena leitakse, et olulisimad tegurid häälemudeli kvaliteedi juures on treeningkorpuse maht ja kvaliteet. Teine määrav komponent on tekstianalüüs ja tema võimekus efektiivselt teisendada ortograafiline tekst hääldustekstiks. Olulisuselt kolmandaks headuse hinnangu mõjutajaks hinnatakse foneetiliste ja fonoloogiliste kontekstitegurite optimeerimine. Lõpuks tuuakse ära võimalikud tegevused, mille tulemusena on võimalik Markovi peitmudelitel põhineva kõnemudeliga genereeritud sünteeskõne kvaliteeti tõsta.The main purpose of this thesis is to create hidden Markov model based speech models for both male and female voice for Estonian text-to-speech synthesis. To begin with, a brief overview of text-to-speech synthesis process is given, alongside with description of components in a typical speech synthesis system and popular techniques in common use. Subsequently, the thesis focuses on statistical parametric speech synthesis in particular. The technique called hidden Markov model-based speech synthesis which is utilized in the system HTS (HMM-based Speech Synthesis System) is described. HTS is employed to generate voice models needed for this bachelor work. Discussed are the advantages and drawbacks of the system HTS and described are solutions to some of the problems. In the practical part of the work the creation of speech corpus in Institute of the Estonian Language is analyzed. Presented are the guidelines for creation of the corpus as well as its connection with text analysis module and related constraints. Described are the changes to phonetic system in use followed from development of text analysis modules. Given are the aspects related to recording the speech corpus and guidelines to evaluate the quality of the signal produced. Analyzed are the unforeseen findings that affect quality of the corpus and from these new guidelines for corpus construction are derived. Described is the process of adapting Estonian-related training data and linguistic specification to the system HTS. Linguistic specification is compatible with text analysis module in order to enable implementation of the trained voice models to Estonian speech synthesis applications. Experiments are carried out on data from different speakers, subcorpora and linguistic specifications. Presented are examples of generated speech for both male and female voice models trained with HTS. Speech model evaluation process has given expected findings. The most important factors that affect voice model quality are the quality and size of training corpus. It is followed by the ability of text analysis module to generate accurate pronounciation text and optimizing of phonetical and phonological contextual factors. In the end, proposed are two possible courses of action to improve the quality of HMM-based speech models trained: implementation of STRAIGHT vocoder to reduce buzzyness of synthesized speech and optimizing of phonetical and phonological contextual factors

    HMM Based Text-to-Speech Synthesis for Telugu

    Get PDF
    This thesis describes a novel approach to build a general purpose working Telugu text-to- speech synthesis system (TTS) based on hidden Markov model (HMM) which is reasonably intelligible, natural sounding and exible. There have been several attempts proposed to use HMM for constructing TTS systems. Most of such systems are based on waveform concatenation techniques. To fully convey information present in speech signals, text-to-speech synthesis systems are required to have an ability to generate natural sounding speech with arbitrary speakers individualities and emotions (e.g., anger, sadness, joy). To represent all these factors the Mel- cepstral coefficients are extracted as spectral parameters. Excitation parameters are extracted using fundamental frequency(F0)

    Voice characteristics conversion for HMM-based speech synthesis system

    No full text
    corecore