Diffusion-based singing voice conversion (SVC) models have shown better
synthesis quality compared to traditional methods. However, in cross-domain SVC
scenarios, where there is a significant disparity in pitch between the source
and target voice domains, the models tend to generate audios with hoarseness,
posing challenges in achieving high-quality vocal outputs. Therefore, in this
paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice
Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without
requiring additional data or increasing model parameters. We innovatively
introduce a cycle pitch shifting training strategy and Structural Similarity
Index (SSIM) loss into our SVC model, effectively enhancing its performance.
Experimental results on the public singing datasets M4Singer indicate that our
proposed method significantly improves model performance in both general SVC
scenarios and particularly in cross-domain SVC scenarios.Comment: Accepted by Interspeech 202