21 research outputs found
Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism
This paper proposes a novel sequence-to-sequence (seq2seq) model with a
musical note position-aware attention mechanism for singing voice synthesis
(SVS). A seq2seq modeling approach that can simultaneously perform acoustic and
temporal modeling is attractive. However, due to the difficulty of the temporal
modeling of singing voices, many recent SVS systems with an
encoder-decoder-based model still rely on explicitly on duration information
generated by additional modules. Although some studies perform simultaneous
modeling using seq2seq models with an attention mechanism, they have
insufficient robustness against temporal modeling. The proposed attention
mechanism is designed to estimate the attention weights by considering the
rhythm given by the musical score. Furthermore, several techniques are also
introduced to improve the modeling performance of the singing voice.
Experimental results indicated that the proposed model is effective in terms of
both naturalness and robustness of timing.Comment: 5 pages, 4 figures, 2 tables, submitted to ICASSP 202
BiSinger: Bilingual Singing Voice Synthesis
Although Singing Voice Synthesis (SVS) has made great strides with
Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains
relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system
for English and Chinese Mandarin. Current systems require separate models per
language and cannot accurately represent both Chinese and English, hindering
code-switch SVS. To address this gap, we design a shared representation between
Chinese and English singing voices, achieved by using the CMU dictionary with
mapping rules. We fuse monolingual singing datasets with open-source singing
voice conversion techniques to generate bilingual singing voices while also
exploring the potential use of bilingual speech data. Experiments affirm that
our language-independent representation and incorporation of related datasets
enable a single model with enhanced performance in English and code-switch SVS
while maintaining Chinese song performance. Audio samples are available at
https://bisinger-svs.github.io.Comment: Accepted by ASRU202
Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information
This paper presents an end-to-end high-quality singing voice synthesis (SVS)
system that uses bidirectional encoder representation from Transformers (BERT)
derived semantic embeddings to improve the expressiveness of the synthesized
singing voice. Based on the main architecture of recently proposed VISinger, we
put forward several specific designs for expressive singing voice synthesis.
First, different from the previous SVS models, we use text representation of
lyrics extracted from pre-trained BERT as additional input to the model. The
representation contains information about semantics of the lyrics, which could
help SVS system produce more expressive and natural voice. Second, we further
introduce an energy predictor to stabilize the synthesized voice and model the
wider range of energy variations that also contribute to the expressiveness of
singing voice. Last but not the least, to attenuate the off-key issues, the
pitch predictor is re-designed to predict the real to note pitch ratio. Both
objective and subjective experimental results indicate that the proposed SVS
system can produce singing voice with higher-quality outperforming VISinger