End-to-end singing voice synthesis (SVS) model VISinger can achieve better
performance than the typical two-stage model with fewer parameters. However,
VISinger has several problems: text-to-phase problem, the end-to-end model
learns the meaningless mapping of text-to-phase; glitches problem, the harmonic
components corresponding to the periodic signal of the voiced segment occurs a
sudden change with audible artefacts; low sampling rate, the sampling rate of
24KHz does not meet the application needs of high-fidelity generation with the
full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to
address these issues by integrating the digital signal processing (DSP) methods
with VISinger. Specifically, inspired by recent advances in differentiable
digital signal processing (DDSP), we incorporate a DSP synthesizer into the
decoder to solve the above issues. The DSP synthesizer consists of a harmonic
synthesizer and a noise synthesizer to generate periodic and aperiodic signals,
respectively, from the latent representation z in VISinger. It supervises the
posterior encoder to extract the latent representation without phase
information and avoid the prior encoder modelling text-to-phase mapping. To
avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms
generated by the DSP synthesizer as a condition to produce the singing voice.
Moreover, with the improved waveform decoder, VISinger 2 manages to generate
44.1kHz singing audio with richer expression and better quality. Experiments on
OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and
RefineSinger in both subjective and objective metrics.Comment: Submitted to ICASSP 202