575 research outputs found
Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation
This paper proposes an expressive singing voice synthesis system by
introducing explicit vibrato modeling and latent energy representation. Vibrato
is essential to the naturalness of synthesized sound, due to the inherent
characteristics of human singing. Hence, a deep learning-based vibrato model is
introduced in this paper to control the vibrato's likeliness, rate, depth and
phase in singing, where the vibrato likeliness represents the existence
probability of vibrato and it would help improve the singing voice's
naturalness. Actually, there is no annotated label about vibrato likeliness in
existing singing corpus. We adopt a novel vibrato likeliness labeling method to
label the vibrato likeliness automatically. Meanwhile, the power spectrogram of
audio contains rich information that can improve the expressiveness of singing.
An autoencoder-based latent energy bottleneck feature is proposed for
expressive singing voice synthesis. Experimental results on the open dataset
NUS48E show that both the vibrato modeling and the latent energy representation
could significantly improve the expressiveness of singing voice. The audio
samples are shown in the demo website
Pan European Voice Conference - PEVOC 11
The Pan European VOice Conference (PEVOC) was born in 1995 and therefore in 2015 it celebrates the 20th anniversary of its establishment: an important milestone that clearly expresses the strength and interest of the scientific community for the topics of this conference. The most significant themes of PEVOC are singing pedagogy and art, but also occupational voice disorders, neurology, rehabilitation, image and video analysis. PEVOC takes place in different European cities every two years (www.pevoc.org). The PEVOC 11 conference includes a symposium of the Collegium Medicorum Theatri (www.comet collegium.com
Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation
Singing voice detection is the task to identify the frames which contain the
singer vocal or not. It has been one of the main components in music
information retrieval (MIR), which can be applicable to melody extraction,
artist recognition, and music discovery in popular music. Although there are
several methods which have been proposed, a more robust and more complete
system is desired to improve the detection performance. In this paper, our
motivation is to provide an extensive comparison in different stages of singing
voice detection. Based on the analysis a novel method was proposed to build a
more efficiently singing voice detection system. In the proposed system, there
are main three parts. The first is a pre-process of singing voice separation to
extract the vocal without the music. The improvements of several singing voice
separation methods were compared to decide the best one which is integrated to
singing voice detection system. And the second is a deep neural network based
classifier to identify the given frames. Different deep models for
classification were also compared. The last one is a post-process to filter out
the anomaly frame on the prediction result of the classifier. The median filter
and Hidden Markov Model (HMM) based filter as the post process were compared.
Through the step by step module extension, the different methods were compared
and analyzed. Finally, classification performance on two public datasets
indicates that the proposed approach which based on the Long-term Recurrent
Convolutional Networks (LRCN) model is a promising alternative.Comment: 15 page
Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies
Automatic singing voice understanding tasks, such as singer identification,
singing voice transcription, and singing technique classification, benefit from
data-driven approaches that utilize deep learning techniques. These approaches
work well even under the rich diversity of vocal and noisy samples owing to
their representation ability. However, the limited availability of labeled data
remains a significant obstacle to achieving satisfactory performance. In recent
years, self-supervised learning models (SSL models) have been trained using
large amounts of unlabeled data in the field of speech processing and music
classification. By fine-tuning these models for the target tasks, comparable
performance to conventional supervised learning can be achieved with limited
training data. Therefore, in this paper, we investigate the effectiveness of
SSL models for various singing voice recognition tasks. We report the results
of experiments comparing SSL models for three different tasks (i.e., singer
identification, singing voice transcription, and singing technique
classification) as initial exploration and aim to discuss these findings.
Experimental results show that each SSL model achieves comparable performance
and sometimes outperforms compared to state-of-the-art methods on each task. We
also conducted a layer-wise analysis to further understand the behavior of the
SSL models.Comment: Submitted to APSIPA 202
- …