4 research outputs found
Transferring neural speech waveform synthesizers to musical instrument sounds generation
Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the
neural-source-filter (NSF) model have shown good performance in speech
synthesis despite their different methods of waveform generation. The
similarity between speech and music audio synthesis techniques suggests
interesting avenues to explore in terms of the best way to apply speech
synthesizers in the music domain. This work compares three neural synthesizers
used for musical instrument sounds generation under three scenarios: training
from scratch on music data, zero-shot learning from the speech domain, and
fine-tuning-based adaptation from the speech to the music domain. The results
of a large-scale perceptual test demonstrated that the performance of three
synthesizers improved when they were pre-trained on speech data and fine-tuned
on music data, which indicates the usefulness of knowledge from speech data for
music audio generation. Among the synthesizers, WaveGlow showed the best
potential in zero-shot learning while NSF performed best in the other scenarios
and could generate samples that were perceptually close to natural audio.Comment: Submitted to ICASSP 202
Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders
In our previous work, we have proposed a neural vocoder called HiNet which
recovers speech waveforms by predicting amplitude and phase spectra
hierarchically from input acoustic features. In HiNet, the amplitude spectrum
predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic
features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP)
to improve the conventional one. First, acoustic features (i.e., F0 and
mel-cepstra) pass through a knowledge-driven LAS recovery module to obtain
approximate LAS (ALAS). This module is designed based on the combination of
STFT and source-filter theory, in which the source part and the filter part are
designed based on input F0 and mel-cepstra, respectively. Then, the recovered
ALAS are processed by a data-driven LAS refinement module which consists of
multiple trainable convolutional layers to get the final LAS. Experimental
results show that the HiNet vocoder using KDD-ASP can achieve higher quality of
synthetic speech than that using conventional ASP and the WaveRNN vocoder on a
text-to-speech (TTS) task.Comment: Submitted to Interspeech 202
CycleDRUMS: Automatic Drum Arrangement For Bass Lines Using CycleGAN
The two main research threads in computer-based music generation are: the
construction of autonomous music-making systems, and the design of
computer-based environments to assist musicians. In the symbolic domain, the
key problem of automatically arranging a piece music was extensively studied,
while relatively fewer systems tackled this challenge in the audio domain. In
this contribution, we propose CycleDRUMS, a novel method for generating drums
given a bass line. After converting the waveform of the bass into a
mel-spectrogram, we are able to automatically generate original drums that
follow the beat, sound credible and can be directly mixed with the input bass.
We formulated this task as an unpaired image-to-image translation problem, and
we addressed it with CycleGAN, a well-established unsupervised style transfer
framework, originally designed for treating images. The choice to deploy raw
audio and mel-spectrograms enabled us to better represent how humans perceive
music, and to potentially draw sounds for new arrangements from the vast
collection of music recordings accumulated in the last century. In absence of
an objective way of evaluating the output of both generative adversarial
networks and music generative systems, we further defined a possible metric for
the proposed task, partially based on human (and expert) judgement. Finally, as
a comparison, we replicated our results with Pix2Pix, a paired image-to-image
translation network, and we showed that our approach outperforms it.Comment: 9 pages, 5 figures, submitted to IEEE Transactions on Multimedia, the
authors contributed equally to this wor
Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model
Neural source-filter (NSF) waveform models generate speech waveforms by
morphing sine-based source signals through dilated convolution in the time
domain. Although the sine-based source signals help the NSF models to produce
voiced sounds with specified pitch, the sine shape may constrain the generated
waveform when the target voiced sounds are less periodic. In this paper, we
propose a more flexible source signal called cyclic noise, a quasi-periodic
noise sequence given by the convolution of a pulse train and a static random
noise with a trainable decaying rate that controls the signal shape. We further
propose a masked spectral loss to guide the NSF models to produce periodic
voiced sounds from the cyclic noise-based source signal. Results from a
large-scale listening test demonstrated the effectiveness of the cyclic noise
and the masked spectral loss on speaker-independent NSF models in
copy-synthesis experiments on the CMU ARCTIC database.Comment: Submitted to Interspeech 202