23 research outputs found
hf0: A hybrid pitch extraction method for multimodal voice
Pitch or fundamental frequency (f0) extraction is a fundamental problem
studied extensively for its potential applications in speech and clinical
applications. In literature, explicit mode specific (modal speech or singing
voice or emotional/ expressive speech or noisy speech) signal processing and
deep learning f0 extraction methods that exploit the quasi periodic nature of
the signal in time, harmonic property in spectral or combined form to extract
the pitch is developed. Hence, there is no single unified method which can
reliably extract the pitch from various modes of the acoustic signal. In this
work, we propose a hybrid f0 extraction method which seamlessly extracts the
pitch across modes of speech production with very high accuracy required for
many applications. The proposed hybrid model exploits the advantages of deep
learning and signal processing methods to minimize the pitch detection error
and adopts to various modes of acoustic signal. Specifically, we propose an
ordinal regression convolutional neural networks to map the periodicity rich
input representation to obtain the nominal pitch classes which drastically
reduces the number of classes required for pitch detection unlike other deep
learning approaches. Further, the accurate f0 is estimated from the nominal
pitch class labels by filtering and autocorrelation. We show that the proposed
method generalizes to the unseen modes of voice production and various noises
for large scale datasets. Also, the proposed hybrid model significantly reduces
the learning parameters required to train the deep model compared to other
methods. Furthermore,the evaluation measures showed that the proposed method is
significantly better than the state-of-the-art signal processing and deep
learning approaches.Comment: Pitch Extraction, F0 extraction, harmonic signals, speech, monophonic
songs, Convolutional Neural Network, 5 pages, 5 figure
Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners
This paper summarizes some recent advances on a set of tasks related to the
processing of singing using state-of-the-art deep learning techniques. We
discuss their achievements in terms of accuracy and sound quality, and the
current challenges, such as availability of data and computing resources. We
also discuss the impact that these advances do and will have on listeners and
singers when they are integrated in commercial applications.Comment: Keynote speech, 2018 Joint Workshop on Machine Learning for Music.
The Federated Artificial Intelligence Meeting (FAIM), a joint workshop
program of ICML, IJCAI/ECAI, and AAMA
Speech-to-Singing Conversion based on Boundary Equilibrium GAN
This paper investigates the use of generative adversarial network (GAN)-based
models for converting the spectrogram of a speech signal into that of a singing
one, without reference to the phoneme sequence underlying the speech. This is
achieved by viewing speech-to-singing conversion as a style transfer problem.
Specifically, given a speech input, and optionally the F0 contour of the target
singing, the proposed model generates as the output a singing signal with a
progressive-growing encoder/decoder architecture and boundary equilibrium GAN
loss functions. Our quantitative and qualitative analysis show that the
proposed model generates singing voices with much higher naturalness than an
existing non adversarially-trained baseline. For reproducibility, the code will
be publicly available at a GitHub repository upon paper publication.Comment: Accepted for publication at INTERSPEECH 202
Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System
In this study, we define the identity of the singer with two independent
concepts - timbre and singing style - and propose a multi-singer singing
synthesis system that can model them separately. To this end, we extend our
single-singer model into a multi-singer model in the following ways: first, we
design a singer identity encoder that can adequately reflect the identity of a
singer. Second, we use encoded singer identity to condition the two independent
decoders that model timbre and singing style, respectively. Through a user
study with the listening tests, we experimentally verify that the proposed
framework is capable of generating a natural singing voice of high quality
while independently controlling the timbre and singing style. Also, by using
the method of changing singing styles while fixing the timbre, we suggest that
our proposed network can produce a more expressive singing voice.Comment: 4 pages, Submitted to ICASSP202
PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network
Singing voice conversion is to convert a singer's voice to another one's
voice without changing singing content. Recent work shows that unsupervised
singing voice conversion can be achieved with an autoencoder-based approach
[1]. However, the converted singing voice can be easily out of key, showing
that the existing approach cannot model the pitch information precisely. In
this paper, we propose to advance the existing unsupervised singing voice
conversion method proposed in [1] to achieve more accurate pitch translation
and flexible pitch manipulation. Specifically, the proposed PitchNet added an
adversarially trained pitch regression network to enforce the encoder network
to learn pitch invariant phoneme representation, and a separate module to feed
pitch extracted from the source audio to the decoder network. Our evaluation
shows that the proposed method can greatly improve the quality of the converted
singing voice (2.92 vs 3.75 in MOS). We also demonstrate that the pitch of
converted singing can be easily controlled during generation by changing the
levels of the extracted pitch before passing it to the decoder network.Comment: Accepted by ICASSP 202
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
This paper presents XiaoiceSing, a high-quality singing voice synthesis
system which employs an integrated network for spectrum, F0 and duration
modeling. We follow the main architecture of FastSpeech while proposing some
singing-specific design: 1) Besides phoneme ID and position encoding, features
from musical score (e.g.note pitch and length) are also added. 2) To attenuate
off-key issues, we add a residual connection in F0 prediction. 3) In addition
to the duration loss of each phoneme, the duration of all the phonemes in a
musical note is accumulated to calculate the syllable duration loss for rhythm
enhancement. Experiment results show that XiaoiceSing outperforms the baseline
system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on
pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests,
the proposed F0 and duration modeling methods achieve 97.3% and 84.3%
preference rate over baseline respectively, which demonstrates the overwhelming
advantages of XiaoiceSing
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss
The neural network (NN) based singing voice synthesis (SVS) systems require
sufficient data to train well and are prone to over-fitting due to data
scarcity. However, we often encounter data limitation problem in building SVS
systems because of high data acquisition and annotation costs. In this work, we
propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing
model to regularize the network. With a one-hour open-source singing voice
database, we explore the impact of the PE loss on various mainstream
sequence-to-sequence models, including the RNN-based, transformer-based, and
conformer-based models. Our experiments show that the PE loss can mitigate the
over-fitting problem and significantly improve the synthesized singing quality
reflected in objective and subjective evaluations.Comment: Accepted by ICASSP202
ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system
based on duration allocated Tacotron-like acoustic models and WaveRNN neural
vocoders. Different from the conventional SVS models, the proposed ByteSing
employs Tacotron-like encoder-decoder structures as the acoustic models, in
which the CBHG models and recurrent neural networks (RNNs) are explored as
encoders and decoders respectively. Meanwhile an auxiliary phoneme duration
prediction model is utilized to expand the input sequence, which can enhance
the model controllable capacity, model stability and tempo prediction accuracy.
WaveRNN neural vocoders are also adopted as neural vocoders to further improve
the voice quality of synthesized songs. Both objective and subjective
experimental results prove that the SVS method proposed in this paper can
produce quite natural, expressive and high-fidelity songs by improving the
pitch and spectrogram prediction accuracy and the models using attention
mechanism can achieve best performance.Comment: Accepted by ISCSLP202
Synchronising speech segments with musical beats in Mandarin and English singing
Generating synthesised singing voice with models trained on speech data has
many advantages due to the models' flexibility and controllability. However,
since the information about the temporal relationship between segments and
beats are lacking in speech training data, the synthesised singing may sound
off-beat at times. Therefore, the availability of the information on the
temporal relationship between speech segments and music beats is crucial. The
current study investigated the segment-beat synchronisation in singing data,
with hypotheses formed based on the linguistics theories of P-centre and
sonority hierarchy. A Mandarin corpus and an English corpus of professional
singing data were manually annotated and analysed. The results showed that the
presence of musical beats was more dependent on segment duration than sonority.
However, the sonority hierarchy and the P-centre theory were highly related to
the location of beats. Mandarin and English demonstrated cross-linguistic
variations despite exhibiting common patterns.Comment: To be published in the Proceeding of Interspeech 202
Unsupervised Cross-Domain Singing Voice Conversion
We present a wav-to-wav generative model for the task of singing voice
conversion from any identity. Our method utilizes both an acoustic model,
trained for the task of automatic speech recognition, together with melody
extracted features to drive a waveform-based generator. The proposed generative
architecture is invariant to the speaker's identity and can be trained to
generate target singers from unlabeled training data, using either speech or
singing sources. The model is optimized in an end-to-end fashion without any
manual supervision, such as lyrics, musical notes or parallel samples. The
proposed approach is fully-convolutional and can generate audio in real-time.
Experiments show that our method significantly outperforms the baseline methods
while generating convincingly better audio samples than alternative attempts