Search CORE

2 research outputs found

A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems

Author: Matsunaga Noriyuki
Ohtani Yamato
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Yasuhara Kazuki
Publication venue
Publication date: 06/08/2020
Field of study

Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch are time and resource consuming. An economical approach is to develop a neural vocoder to enhance the speech generated by existing or low-cost TTS systems. Nonetheless, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders. Because of the generality, this framework can be applied to arbitrary TTS systems and neural vocoders. In this paper, we apply the proposed method with a state-of-the-art WaveNet vocoder for two different basic TTS systems, and both objective and subjective experimental results confirm the effectiveness of the proposed framework.Comment: 5 pages, 8 figures, 1 table. Proc. Interspeech, 202

arXiv.org e-Print Archive

Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2020
Field of study

In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018.Comment: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Acces

arXiv.org e-Print Archive