Search CORE

35,800 research outputs found

DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

Author: Li Zijin
Lu Heng
Weng Chao
Wu Yusong
Xie Xiang
Yu Chengzhu
Yu Dong
Zhang Chunlei
Zhang Liqiang
Publication venue
Publication date: 07/08/2020
Field of study

Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small.Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.Comment: Accepted by Interspeech 202

arXiv.org e-Print Archive

Modeling Singing F0 With Neural Network Driven Transition-Sustain Models

Author: Hua Kanru
Publication venue
Publication date: 11/03/2018
Field of study

This study focuses on generating fundamental frequency (F0) curves of singing voice from musical scores stored in a midi-like notation. Current statistical parametric approaches to singing F0 modeling meet difficulties in reproducing vibratos and the temporal details at note boundaries due to the oversmoothing tendency of statistical models. This paper presents a neural network based solution that models a pair of neighboring notes at a time (the transition model) and uses a separate network for generating vibratos (the sustain model). Predictions from the two models are combined by summation after proper enveloping to enforce continuity. In the training phase, mild misalignment between the scores and the target F0 is addressed by back-propagating the gradients to the networks' inputs. Subjective listening tests on the NITech singing database show that transition-sustain models are able to generate F0 trajectories close to the original performance.Comment: 5 pages, 5 figure

arXiv.org e-Print Archive

A Comparative Study of Pitch Extraction Algorithms on a Large Variety of Singing Sounds

Author: Babacan Onur
d'Alessandro Nicolas
Drugman Thomas
Dutoit Thierry
Henrich Nathalie
Publication venue
Publication date: 29/12/2019
Field of study

The problem of pitch tracking has been extensively studied in the speech research community. The goal of this paper is to investigate how these techniques should be adapted to singing voice analysis, and to provide a comparative evaluation of the most representative state-of-the-art approaches. This study is carried out on a large database of annotated singing sounds with aligned EGG recordings, comprising a variety of singer categories and singing exercises. The algorithmic performance is assessed according to the ability to detect voicing boundaries and to accurately estimate pitch contour. First, we evaluate the usefulness of adapting existing methods to singing voice analysis. Then we compare the accuracy of several pitch-extraction algorithms, depending on singer category and laryngeal mechanism. Finally, we analyze their robustness to reverberation

arXiv.org e-Print Archive

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

Author: Guo Shuai
Huo Nan
Jin Qin
Shi Jiatong
Zhang Yuekai
Publication venue
Publication date: 26/02/2021
Field of study

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation costs. In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.Comment: Accepted by ICASSP202

arXiv.org e-Print Archive

Toward Faultless Content-Based Playlists Generation for Instrumentals

Author: Bayle Yann
Hanna Pierre
Robine Matthias
Publication venue
Publication date: 22/11/2017
Field of study

This study deals with content-based musical playlists generation focused on Songs and Instrumentals. Automatic playlist generation relies on collaborative filtering and autotagging algorithms. Autotagging can solve the cold start issue and popularity bias that are critical in music recommender systems. However, autotagging remains to be improved and cannot generate satisfying music playlists. In this paper, we suggest improvements toward better autotagging-generated playlists compared to state-of-the-art. To assess our method, we focus on the Song and Instrumental tags. Song and Instrumental are two objective and opposite tags that are under-studied compared to genres or moods, which are subjective and multi-modal tags. In this paper, we consider an industrial real-world musical database that is unevenly distributed between Songs and Instrumentals and bigger than databases used in previous studies. We set up three incremental experiments to enhance automatic playlist generation. Our suggested approach generates an Instrumental playlist with up to three times less false positives than cutting edge methods. Moreover, we provide a design of experiment framework to foster research on Songs and Instrumentals. We give insight on how to improve further the quality of generated playlists and to extend our methods to other musical tags. Furthermore, we provide the source code to guarantee reproducible research.Comment: single-column 20 pages, 3 figures, 6 table

arXiv.org e-Print Archive

Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners

Author: Blaauw Merlijn
Bonada Jordi
Chandna Pritish
Cuesta Helena
Gómez Emilia
Publication venue
Publication date: 09/07/2018
Field of study

This paper summarizes some recent advances on a set of tasks related to the processing of singing using state-of-the-art deep learning techniques. We discuss their achievements in terms of accuracy and sound quality, and the current challenges, such as availability of data and computing resources. We also discuss the impact that these advances do and will have on listeners and singers when they are integrated in commercial applications.Comment: Keynote speech, 2018 Joint Workshop on Machine Learning for Music. The Federated Artificial Intelligence Meeting (FAIM), a joint workshop program of ICML, IJCAI/ECAI, and AAMA

arXiv.org e-Print Archive

Towards Fine-Grained Prosody Control for Voice Conversion

Author: Lian Zheng
Wen Zhengqi
Publication venue
Publication date: 27/05/2020
Field of study

In a typical voice conversion system, prior works utilize various acoustic features (e.g., the pitch, voiced/unvoiced flag, aperiodicity) of the source speech to control the prosody of generated waveform. However, the prosody is related with many factors, such as the intonation, stress and rhythm. It is a challenging task to perfectly describe the prosody through acoustic features. To deal with this problem, we propose prosody embeddings to model prosody. These embeddings are learned from the source speech in an unsupervised manner. We conduct experiments on our Mandarin corpus recoded by professional speakers. Experimental results demonstrate that the proposed method enables fine-grained control of the prosody. In challenging situations (such as the source speech is a singing song), our proposed method can also achieve promising results

arXiv.org e-Print Archive

An Extensive Analysis of Query by Singing/Humming System Through Query Proportion

Author: Bhajantri Nagappa U.
Nagavi Trisiladevi C.
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 09/01/2013
Field of study

Query by Singing/Humming (QBSH) is a Music Information Retrieval (MIR) system with small audio excerpt as query. The rising availability of digital music stipulates effective music retrieval methods. Further, MIR systems support content based searching for music and requires no musical acquaintance. Current work on QBSH focuses mainly on melody features such as pitch, rhythm, note etc., size of databases, response time, score matching and search algorithms. Even though a variety of QBSH techniques are proposed, there is a dearth of work to analyze QBSH through query excerption. Here, we present an analysis that works on QBSH through query excerpt. To substantiate a series of experiments are conducted with the help of Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coefficients (LPC) and Linear Predictive Cepstral Coefficients (LPCC) to portray the robustness of the knowledge representation. Proposed experiments attempt to reveal that retrieval performance as well as precision diminishes in the snail phase with the growing database size.Comment: 14 pages,11 figures; The International Journal of Multimedia & Its Applications (IJMA) Vol.4, No.6, December 2012. arXiv admin note: text overlap with arXiv:1003.4083 by other author

arXiv.org e-Print Archive

Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation

Author: Cohen-Hadria Alice
Peeters Geoffroy
Roebel Axel
Publication venue
Publication date: 04/03/2019
Field of study

State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net and the Wave-U-Net that are structurally similar, but work on different input representations. First, we report a few results on variations of the U-Net model. Second, we will discuss the potential of state of the art speech and music transformation algorithms for augmentation of existing data sets and demonstrate that the effect of these augmentations depends on the signal representations used by the model. The results demonstrate a considerable improvement due to the augmentation for both models. But pitch transposition is the most effective augmentation strategy for the U-Net model, while transposition, time stretching, and formant shifting have a much more balanced effect on the Wave-U-Net model. Finally, we compare the two models on the same dataset

arXiv.org e-Print Archive

Singing voice synthesis based on convolutional neural networks

Author: Hashimoto Kei
Nakamura Kazuhiro
Nankaku Yoshihiko
Oura Keiichiro
Tokuda Keiichi
Publication venue
Publication date: 25/06/2019
Field of study

The present paper describes a singing voice synthesis based on convolutional neural networks (CNNs). Singing voice synthesis systems based on deep neural networks (DNNs) are currently being proposed and are improving the naturalness of synthesized singing voices. In these systems, the relationship between musical score feature sequences and acoustic feature sequences extracted from singing voices is modeled by DNNs. Then, an acoustic feature sequence of an arbitrary musical score is output in units of frames by the trained DNNs, and a natural trajectory of a singing voice is obtained by using a parameter generation algorithm. As singing voices contain rich expression, a powerful technique to model them accurately is required. In the proposed technique, long-term dependencies of singing voices are modeled by CNNs. An acoustic feature sequence is generated in units of segments that consist of long-term frames, and a natural trajectory is obtained without the parameter generation algorithm. Experimental results in a subjective listening test show that the proposed architecture can synthesize natural sounding singing voices.Comment: Singing voice samples (Japanese, English, Chinese): https://www.techno-speech.com/news-20181214a-e

arXiv.org e-Print Archive