14 research outputs found

    Nova metoda adaptacije na govornika u parametarskoj sintezi govora

    Get PDF
    The thesis describes and compares several methods of adaptation to the speaker using deep neural networks. Simple method of system adaptation, method proposing separate layers for different speakers, as well as adaptation in two phases. The last method starts from multispeaker model and a trained speaker space. Adaptation to a new speaker takes place in two phases: 1) searching for the optimal point in the speaker embedding space; 2) adapting the parameters of the rest of the network. It has been shown that the last approach yields the best results, by comparing objective measures, as well as by listening tests.У дисертацији је описано и упоређено неколико метода адаптације на говорника помоћу дубоких неуронских мрежа. Метода дообуке система, метода дељених и засебних слојева за различите говорнике, као и адаптација у две фазе. Последња метода као полазну тачку има систем обучен на више говорника и обучени простор говорника. Адаптација на новог говорника се одвија у две фазе: тражење оптималне тачке у простору говорника и адаптација параметара остатка мреже. Показано је да се најбољи резултати добијају коришћењем последње методе, путем поређења објективних мера, као и преко тестова слушања.U disertaciji je opisano i upoređeno nekoliko metoda adaptacije na govornika pomoću dubokih neuronskih mreža. Metoda doobuke sistema, metoda deljenih i zasebnih slojeva za različite govornike, kao i adaptacija u dve faze. Poslednja metoda kao polaznu tačku ima sistem obučen na više govornika i obučeni prostor govornika. Adaptacija na novog govornika se odvija u dve faze: traženje optimalne tačke u prostoru govornika i adaptacija parametara ostatka mreže. Pokazano je da se najbolji rezultati dobijaju korišćenjem poslednje metode, putem poređenja objektivnih mera, kao i preko testova slušanja

    AUTOMATIC PROSODY GENERATION IN A TEXT-TO-SPEECH SYSTEM FOR HEBREW

    Get PDF
    The paper presents the module for automatic prosody generation within a system for automatic synthesis of high-quality speech based on arbitrary text in Hebrew. The high quality of synthesis is due to the high accuracy of automatic prosody generation, enabling the introduction of elements of natural sentence prosody of Hebrew. Automatic morphological annotation of text is based on the application of an expert algorithm relying on transformational rules. Syntactic-prosodic parsing is also rule based, while the generation of the acoustic representation of prosodic features is based on classification and regression trees. A tree structure generated during the training phase enables accurate prediction of the acoustic representatives of prosody, namely, durations of phonetic segments as well as temporal evolution of fundamental frequency and energy. Such an approach to automatic prosody generation has lead to an improvement in the quality of synthesized speech, as confirmed by listening tests

    Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings

    Get PDF
    The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on

    Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding

    Get PDF
    The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the similarities and differences between speakers and speaking styles more efficiently. The initial model from which speaker/style adaptation was carried out was a multi-speaker/multi-style model based on 8.5 hours of American English speech data which corresponds to 16 different speaker/style combinations. The results of the experiments show that both versions of the obtained system, one using 10 minutes and the other as little as 30 seconds of target data, outperform the state of the art in parametric speaker/style-dependent speech synthesis. This opens a wide range of application of speaker/style dependent speech synthesis based on small quantities of training data, in domains ranging from customer interaction in call centers to robot-assisted medical therapy
    corecore