35,800 research outputs found
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System
Singing voice conversion is converting the timbre in the source singing to
the target speaker's voice while keeping singing content the same. However,
singing data for target speaker is much more difficult to collect compared with
normal speech data.In this paper, we introduce a singing voice conversion
algorithm that is capable of generating high quality target speaker's singing
using only his/her normal speech data. First, we manage to integrate the
training and conversion process of speech and singing into one framework by
unifying the features used in standard speech synthesis system and singing
synthesis system. In this way, normal speech data can also contribute to
singing voice conversion training, making the singing voice conversion system
more robust especially when the singing database is small.Moreover, in order to
achieve one-shot singing voice conversion, a speaker embedding module is
developed using both speech and singing data, which provides target speaker
identify information during conversion. Experiments indicate proposed sing
conversion system can convert source singing to target speaker's high-quality
singing with only 20 seconds of target speaker's enrollment speech data.Comment: Accepted by Interspeech 202
Modeling Singing F0 With Neural Network Driven Transition-Sustain Models
This study focuses on generating fundamental frequency (F0) curves of singing
voice from musical scores stored in a midi-like notation. Current statistical
parametric approaches to singing F0 modeling meet difficulties in reproducing
vibratos and the temporal details at note boundaries due to the oversmoothing
tendency of statistical models. This paper presents a neural network based
solution that models a pair of neighboring notes at a time (the transition
model) and uses a separate network for generating vibratos (the sustain model).
Predictions from the two models are combined by summation after proper
enveloping to enforce continuity. In the training phase, mild misalignment
between the scores and the target F0 is addressed by back-propagating the
gradients to the networks' inputs. Subjective listening tests on the NITech
singing database show that transition-sustain models are able to generate F0
trajectories close to the original performance.Comment: 5 pages, 5 figure
A Comparative Study of Pitch Extraction Algorithms on a Large Variety of Singing Sounds
The problem of pitch tracking has been extensively studied in the speech
research community. The goal of this paper is to investigate how these
techniques should be adapted to singing voice analysis, and to provide a
comparative evaluation of the most representative state-of-the-art approaches.
This study is carried out on a large database of annotated singing sounds with
aligned EGG recordings, comprising a variety of singer categories and singing
exercises. The algorithmic performance is assessed according to the ability to
detect voicing boundaries and to accurately estimate pitch contour. First, we
evaluate the usefulness of adapting existing methods to singing voice analysis.
Then we compare the accuracy of several pitch-extraction algorithms, depending
on singer category and laryngeal mechanism. Finally, we analyze their
robustness to reverberation
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss
The neural network (NN) based singing voice synthesis (SVS) systems require
sufficient data to train well and are prone to over-fitting due to data
scarcity. However, we often encounter data limitation problem in building SVS
systems because of high data acquisition and annotation costs. In this work, we
propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing
model to regularize the network. With a one-hour open-source singing voice
database, we explore the impact of the PE loss on various mainstream
sequence-to-sequence models, including the RNN-based, transformer-based, and
conformer-based models. Our experiments show that the PE loss can mitigate the
over-fitting problem and significantly improve the synthesized singing quality
reflected in objective and subjective evaluations.Comment: Accepted by ICASSP202
Toward Faultless Content-Based Playlists Generation for Instrumentals
This study deals with content-based musical playlists generation focused on
Songs and Instrumentals. Automatic playlist generation relies on collaborative
filtering and autotagging algorithms. Autotagging can solve the cold start
issue and popularity bias that are critical in music recommender systems.
However, autotagging remains to be improved and cannot generate satisfying
music playlists. In this paper, we suggest improvements toward better
autotagging-generated playlists compared to state-of-the-art. To assess our
method, we focus on the Song and Instrumental tags. Song and Instrumental are
two objective and opposite tags that are under-studied compared to genres or
moods, which are subjective and multi-modal tags. In this paper, we consider an
industrial real-world musical database that is unevenly distributed between
Songs and Instrumentals and bigger than databases used in previous studies. We
set up three incremental experiments to enhance automatic playlist generation.
Our suggested approach generates an Instrumental playlist with up to three
times less false positives than cutting edge methods. Moreover, we provide a
design of experiment framework to foster research on Songs and Instrumentals.
We give insight on how to improve further the quality of generated playlists
and to extend our methods to other musical tags. Furthermore, we provide the
source code to guarantee reproducible research.Comment: single-column 20 pages, 3 figures, 6 table
Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners
This paper summarizes some recent advances on a set of tasks related to the
processing of singing using state-of-the-art deep learning techniques. We
discuss their achievements in terms of accuracy and sound quality, and the
current challenges, such as availability of data and computing resources. We
also discuss the impact that these advances do and will have on listeners and
singers when they are integrated in commercial applications.Comment: Keynote speech, 2018 Joint Workshop on Machine Learning for Music.
The Federated Artificial Intelligence Meeting (FAIM), a joint workshop
program of ICML, IJCAI/ECAI, and AAMA
Towards Fine-Grained Prosody Control for Voice Conversion
In a typical voice conversion system, prior works utilize various acoustic
features (e.g., the pitch, voiced/unvoiced flag, aperiodicity) of the source
speech to control the prosody of generated waveform. However, the prosody is
related with many factors, such as the intonation, stress and rhythm. It is a
challenging task to perfectly describe the prosody through acoustic features.
To deal with this problem, we propose prosody embeddings to model prosody.
These embeddings are learned from the source speech in an unsupervised manner.
We conduct experiments on our Mandarin corpus recoded by professional speakers.
Experimental results demonstrate that the proposed method enables fine-grained
control of the prosody. In challenging situations (such as the source speech is
a singing song), our proposed method can also achieve promising results
An Extensive Analysis of Query by Singing/Humming System Through Query Proportion
Query by Singing/Humming (QBSH) is a Music Information Retrieval (MIR) system
with small audio excerpt as query. The rising availability of digital music
stipulates effective music retrieval methods. Further, MIR systems support
content based searching for music and requires no musical acquaintance. Current
work on QBSH focuses mainly on melody features such as pitch, rhythm, note
etc., size of databases, response time, score matching and search algorithms.
Even though a variety of QBSH techniques are proposed, there is a dearth of
work to analyze QBSH through query excerption. Here, we present an analysis
that works on QBSH through query excerpt. To substantiate a series of
experiments are conducted with the help of Mel-Frequency Cepstral Coefficients
(MFCC), Linear Predictive Coefficients (LPC) and Linear Predictive Cepstral
Coefficients (LPCC) to portray the robustness of the knowledge representation.
Proposed experiments attempt to reveal that retrieval performance as well as
precision diminishes in the snail phase with the growing database size.Comment: 14 pages,11 figures; The International Journal of Multimedia & Its
Applications (IJMA) Vol.4, No.6, December 2012. arXiv admin note: text
overlap with arXiv:1003.4083 by other author
Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation
State-of-the-art singing voice separation is based on deep learning making
use of CNN structures with skip connections (like U-net model, Wave-U-Net
model, or MSDENSELSTM). A key to the success of these models is the
availability of a large amount of training data. In the following study, we are
interested in singing voice separation for mono signals and will investigate
into comparing the U-Net and the Wave-U-Net that are structurally similar, but
work on different input representations. First, we report a few results on
variations of the U-Net model. Second, we will discuss the potential of state
of the art speech and music transformation algorithms for augmentation of
existing data sets and demonstrate that the effect of these augmentations
depends on the signal representations used by the model. The results
demonstrate a considerable improvement due to the augmentation for both models.
But pitch transposition is the most effective augmentation strategy for the
U-Net model, while transposition, time stretching, and formant shifting have a
much more balanced effect on the Wave-U-Net model. Finally, we compare the two
models on the same dataset
Singing voice synthesis based on convolutional neural networks
The present paper describes a singing voice synthesis based on convolutional
neural networks (CNNs). Singing voice synthesis systems based on deep neural
networks (DNNs) are currently being proposed and are improving the naturalness
of synthesized singing voices. In these systems, the relationship between
musical score feature sequences and acoustic feature sequences extracted from
singing voices is modeled by DNNs. Then, an acoustic feature sequence of an
arbitrary musical score is output in units of frames by the trained DNNs, and a
natural trajectory of a singing voice is obtained by using a parameter
generation algorithm. As singing voices contain rich expression, a powerful
technique to model them accurately is required. In the proposed technique,
long-term dependencies of singing voices are modeled by CNNs. An acoustic
feature sequence is generated in units of segments that consist of long-term
frames, and a natural trajectory is obtained without the parameter generation
algorithm. Experimental results in a subjective listening test show that the
proposed architecture can synthesize natural sounding singing voices.Comment: Singing voice samples (Japanese, English, Chinese):
https://www.techno-speech.com/news-20181214a-e
- …