4 research outputs found
Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision
This paper presents methods of making using of text supervision to improve
the performance of sequence-to-sequence (seq2seq) voice conversion. Compared
with conventional frame-to-frame voice conversion approaches, the seq2seq
acoustic modeling method proposed in our previous work achieved higher
naturalness and similarity. In this paper, we further improve its performance
by utilizing the text transcriptions of parallel training data. First, a
multi-task learning structure is designed which adds auxiliary classifiers to
the middle layers of the seq2seq model and predicts linguistic labels as a
secondary task. Second, a data-augmentation method is proposed which utilizes
text alignment to produce extra parallel sequences for model training.
Experiments are conducted to evaluate our proposed method with training sets at
different sizes. Experimental results show that the multi-task learning with
linguistic labels is effective at reducing the errors of seq2seq voice
conversion. The data-augmentation method can further improve the performance of
seq2seq voice conversion when only 50 or 100 training utterances are available.Comment: 5 pages, 4 figures, 2 tables. Submitted to IEEE ICASSP 201