2 research outputs found
DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis
This paper proposes novel algorithms for speaker embedding using subjective
inter-speaker similarity based on deep neural networks (DNNs). Although
conventional DNN-based speaker embedding such as a -vector can be applied to
multi-speaker modeling in speech synthesis, it does not correlate with the
subjective inter-speaker similarity and is not necessarily appropriate speaker
representation for open speakers whose speech utterances are not included in
the training data. We propose two training algorithms for DNN-based speaker
embedding model using an inter-speaker similarity matrix obtained by
large-scale subjective scoring. One is based on similarity vector embedding and
trains the model to predict a vector of the similarity matrix as speaker
representation. The other is based on similarity matrix embedding and trains
the model to minimize the squared Frobenius norm between the similarity matrix
and the Gram matrix of -vectors, i.e., the inter-speaker similarity derived
from the -vectors. We crowdsourced the inter-speaker similarity scores of
153 Japanese female speakers, and the experimental results demonstrate that our
algorithms learn speaker embedding that is highly correlated with the
subjective similarity. We also apply the proposed speaker embedding to
multi-speaker modeling in DNN-based speech synthesis and reveal that the
proposed similarity vector embedding improves synthetic speech quality for open
speakers whose speech utterances are unseen during the training.Comment: 6 pages, 7 figures, accepted for The 10th ISCA Speech Synthesis
Workshop (SSW10
JVS-MuSiC: Japanese multispeaker singing-voice corpus
Thanks to developments in machine learning techniques, it has become possible
to synthesize high-quality singing voices of a single singer. An open
multispeaker singing-voice corpus would further accelerate the research in
singing-voice synthesis. However, conventional singing-voice corpora only
consist of the singing voices of a single singer. We designed a Japanese
multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze
and synthesize a variety of voices. The corpus consists of 100 singers'
recordings of the same song, Katatsumuri, which is a Japanese children's song.
It also includes another song that is different for each singer. In this paper,
we describe the design of the corpus and experimental analyses using JVS-MuSiC.
We investigated the relationship between 1) the similarity of singing voices
and perceptual oneness of unison singing voices and between 2) the similarity
of singing voices and that of speech. The results suggest that 1) there is a
positive and moderate correlation between singing-voice similarity and the
oneness of unison and that 2) the correlation between singing-voice similarity
and speech similarity is weak. This corpus is freely available online