1 research outputs found
Multimodal Speech Synthesis Architecture for Unsupervised Speaker Adaptation
This paper proposes a new architecture for speaker adaptation of
multi-speaker neural-network speech synthesis systems, in which an unseen
speaker's voice can be built using a relatively small amount of speech data
without transcriptions. This is sometimes called "unsupervised speaker
adaptation". More specifically, we concatenate the layers to the audio inputs
when performing unsupervised speaker adaptation while we concatenate them to
the text inputs when synthesizing speech from text. Two new training schemes
for the new architecture are also proposed in this paper. These training
schemes are not limited to speech synthesis, other applications are suggested.
Experimental results show that the proposed model not only enables adaptation
to unseen speakers using untranscribed speech but it also improves the
performance of multi-speaker modeling and speaker adaptation using transcribed
audio files.Comment: Accepted for Interspeech 2018, Indi