6 research outputs found

    Learning emotions latent representation with CVAE for Text-Driven Expressive AudioVisual Speech Synthesis

    Get PDF
    International audienceGreat improvement has been made in the field of expressive audiovisual Text-to-Speech synthesis (EAVTTS) thanks to deep learning techniques. However, generating realistic speech is still an open issue and researchers in this area have been focusing lately on controlling the speech variability.In this paper, we use different neural architectures to synthesize emotional speech. We study the application of unsupervised learning techniques for emotional speech modeling as well as methods for restructuring emotions representation to make it continuous and more flexible. This manipulation of the emotional representation should allow us to generate new styles of speech by mixing emotions. We first present our expressive audiovisual corpus. We validate the emotional content of this corpus with three perceptual experiments using acoustic only, visual only and audiovisual stimuli.After that, we analyze the performance of a fully connected neural network in learning characteristics specific to different emotions for the phone duration aspect and the acoustic and visual modalities.We also study the contribution of a joint and separate training of the acoustic and visual modalities in the quality of the generated synthetic speech.In the second part of this paper, we use a conditional variational auto-encoder (CVAE) architecture to learn a latent representation of emotions. We applied this method in an unsupervised manner to generate features of expressive speech. We used a probabilistic metric to compute the overlapping degree between emotions latent clusters to choose the best parameters for the CVAE. By manipulating the latent vectors, we were able to generate nuances of a given emotion and to generate new emotions that do not exist in our database. For these new emotions, we obtain a coherent articulation. We conducted four perceptual experiments to evaluate our findings

    Parametric synthesis of expressive speech

    Get PDF
    U disertaciji su opisani postupci sinteze ekspresivnog govora korišćenjem parametarskih pristupa. Pokazano je da se korišćenjem dubokih neuronskih mreža dobijaju bolji rezultati nego korišćenjem skrivenix Markovljevih modela. Predložene su tri nove metode za sintezu ekspresivnog govora korišćenjem dubokih neuronskih mreža: metoda kodova stila, metoda dodatne obuke mreže i arhitektura zasnovana na deljenim skrivenim slojevima. Pokazano je da se najbolji rezultati dobijaju korišćenjem metode kodova stila. Takođe je predložana i nova metoda za transplantaciju emocija/stilova bazirana na deljenim skrivenim slojevima. Predložena metoda ocenjena je bolje od referentne metode iz literature.In this thesis methods for expressive speech synthesis using parametric approaches are presented. It is shown that better results are achived with usage of deep neural networks compared to synthesis based on hidden Markov models. Three new methods for synthesis of expresive speech using deep neural networks are presented: style codes, model re-training and shared hidden layer architecture. It is shown that best results are achived by using style code method. The new method for style transplantation based on shared hidden layer architecture is also proposed. It is shown that this method outperforms referent method from literature

    Expressive Multilingual Speech Synthesizer

    Get PDF
    Cilj istraživanja ove doktorske disertacije je da ispita mogućnost sintetizovanja govora glasom govornika na jeziku koji on nikada nije govorio. Kreirani su višejezični modeli, kako za jezike čiji je govorni materijal anotiran na isti način, tako i za one čiji je govorni materijal anotiran različitim konvencijama, što uključuje i srpski jezik. Po kvalitetu sintetizovanog govora neki modeli čak prevazilaze standardne modele obučene na govornom materijalu na jednom jeziku. Pored arhitekture za višejezične modele, predložen je i način adaptacije takvog modela na novog govornika. Takva adaptacija omogućuje brzu i jednostavnu produkciju novih glasova zadržavajući mogućnost sinteze na svim jezicima podržanim modelom, bez obzira na originalni jezik novog govornika.The aim of this thesis is to investigate the possibility of synthesizing speech in the voice of a speaker in a language which he had never spoken. Multilanguage models are created, both for the languages whose databases are annotated using the same conventions, and for the languages whose databases are annotated using different conventions, which includes the Serbian language. Regarding quality of synthesized speech, some models even surpass the quality of synthesis produced by standard monolanguage models. Beside architecture for multilanguage models, а method for adaptation of such models to the data of a new speaker is proposed. The proposed method of adaptation enables fast and simple production of new voices, while preserving the possibility to synthesize speech in any language supported by the model, regardless of the target speaker’s original language
    corecore