2 research outputs found
A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems
Recently, the effectiveness of text-to-speech (TTS) systems combined with
neural vocoders to generate high-fidelity speech has been shown. However,
collecting the required training data and building these advanced systems from
scratch are time and resource consuming. An economical approach is to develop a
neural vocoder to enhance the speech generated by existing or low-cost TTS
systems. Nonetheless, this approach usually suffers from two issues: 1)
temporal mismatches between TTS and natural waveforms and 2) acoustic
mismatches between training and testing data. To address these issues, we adopt
a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC
data for training and acoustically matched enhanced data for testing the neural
vocoders. Because of the generality, this framework can be applied to arbitrary
TTS systems and neural vocoders. In this paper, we apply the proposed method
with a state-of-the-art WaveNet vocoder for two different basic TTS systems,
and both objective and subjective experimental results confirm the
effectiveness of the proposed framework.Comment: 5 pages, 8 figures, 1 table. Proc. Interspeech, 202
Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression
In this paper, we integrate a simple non-parallel voice conversion (VC)
system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression
technique. The effectiveness of WN as a vocoder for generating high-fidelity
speech waveforms on the basis of acoustic features has been confirmed in recent
works. However, when combining the WN vocoder with a VC system, the distorted
acoustic features, acoustic and temporal mismatches, and exposure bias usually
lead to significant speech quality degradation, making WN generate some very
noisy speech segments called collapsed speech. To tackle the problem, we take
conventional-vocoder-generated speech as the reference speech to derive a
linear predictive coding distribution constraint (LPCDC) to avoid the collapsed
speech problem. Furthermore, to mitigate the negative effects introduced by the
LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the
LPCDC is only applied to the problematic segments to limit the loss of quality
to short periods. Objective and subjective evaluations are conducted, and the
experimental results confirm the effectiveness of the proposed method, which
further improves the speech quality of our previous non-parallel VC system
submitted to Voice Conversion Challenge 2018.Comment: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Acces