30 research outputs found
Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization
Many-to-many voice conversion with non-parallel training data has seen
significant progress in recent years. StarGAN-based models have been interests
of voice conversion. However, most of the StarGAN-based methods only focused on
voice conversion experiments for the situations where the number of speakers
was small, and the amount of training data was large. In this work, we aim at
improving the data efficiency of the model and achieving a many-to-many
non-parallel StarGAN-based voice conversion for a relatively large number of
speakers with limited training samples. In order to improve data efficiency,
the proposed model uses a speaker encoder for extracting speaker embeddings and
conducts adaptive instance normalization (AdaIN) on convolutional weights.
Experiments are conducted with 109 speakers under two low-resource situations,
where the number of training samples is 20 and 5 per speaker. An objective
evaluation shows the proposed model is better than the baseline methods.
Furthermore, a subjective evaluation shows that, for both naturalness and
similarity, the proposed model outperforms the baseline method.Comment: Accepted by ICASSP202
Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision
This paper presents methods of making using of text supervision to improve
the performance of sequence-to-sequence (seq2seq) voice conversion. Compared
with conventional frame-to-frame voice conversion approaches, the seq2seq
acoustic modeling method proposed in our previous work achieved higher
naturalness and similarity. In this paper, we further improve its performance
by utilizing the text transcriptions of parallel training data. First, a
multi-task learning structure is designed which adds auxiliary classifiers to
the middle layers of the seq2seq model and predicts linguistic labels as a
secondary task. Second, a data-augmentation method is proposed which utilizes
text alignment to produce extra parallel sequences for model training.
Experiments are conducted to evaluate our proposed method with training sets at
different sizes. Experimental results show that the multi-task learning with
linguistic labels is effective at reducing the errors of seq2seq voice
conversion. The data-augmentation method can further improve the performance of
seq2seq voice conversion when only 50 or 100 training utterances are available.Comment: 5 pages, 4 figures, 2 tables. Submitted to IEEE ICASSP 201