16 research outputs found
VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
In this paper, we propose a non-parallel any-to-many voice conversion (VC)
method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel
waveform generation method, VoiceGrad is based upon the concepts of score
matching and Langevin dynamics. It uses weighted denoising score matching to
train a score approximator, a fully convolutional network with a U-Net
structure designed to predict the gradient of the log density of the speech
feature sequences of multiple speakers, and performs VC by using annealed
Langevin dynamics to iteratively update an input feature sequence towards the
nearest stationary point of the target distribution based on the trained score
approximator network. Thanks to the nature of this concept, VoiceGrad enables
any-to-many VC, a VC scenario in which the speaker of input speech can be
arbitrary, and allows for non-parallel training, which requires no parallel
utterances or transcriptions.Comment: arXiv admin note: text overlap with arXiv:2008.1260
Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of “extrapolate emotional expressions” is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%