159,093 research outputs found
Audio style transfer
'Style transfer' among images has recently emerged as a very active research
topic, fuelled by the power of convolution neural networks (CNNs), and has
become fast a very popular technology in social media. This paper investigates
the analogous problem in the audio domain: How to transfer the style of a
reference audio signal to a target audio content? We propose a flexible
framework for the task, which uses a sound texture model to extract statistics
characterizing the reference audio style, followed by an optimization-based
audio texture synthesis to modify the target content. In contrast to mainstream
optimization-based visual transfer method, the proposed process is initialized
by the target content instead of random noise and the optimized loss is only
about texture, not structure. These differences proved key for audio style
transfer in our experiments. In order to extract features of interest, we
investigate different architectures, whether pre-trained on other tasks, as
done in image style transfer, or engineered based on the human auditory system.
Experimental results on different types of audio signal confirm the potential
of the proposed approach.Comment: ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Apr 2018, Calgary, France. IEE
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the
goal is to manipulate the timbre of a sound sample from one instrument to match
another instrument while preserving other musical content, such as pitch,
rhythm, and loudness. In principle, one could apply image-based style transfer
techniques to a time-frequency representation of an audio signal, but this
depends on having a representation that allows independent manipulation of
timbre as well as high-quality waveform generation. We introduce TimbreTron, a
method for musical timbre transfer which applies "image" domain style transfer
to a time-frequency representation of the audio signal, and then produces a
high-quality waveform using a conditional WaveNet synthesizer. We show that the
Constant Q Transform (CQT) representation is particularly well-suited to
convolutional architectures due to its approximate pitch equivariance. Based on
human perceptual evaluations, we confirmed that TimbreTron recognizably
transferred the timbre while otherwise preserving the musical content, for both
monophonic and polyphonic samples.Comment: 17 pages, published as a conference paper at ICLR 201
Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects
We propose an end-to-end music mixing style transfer system that converts the
mixing style of an input multitrack to that of a reference song. This is
achieved with an encoder pre-trained with a contrastive objective to extract
only audio effects related information from a reference music recording. All
our models are trained in a self-supervised manner from an already-processed
wet multitrack dataset with an effective data preprocessing method that
alleviates the data scarcity of obtaining unprocessed dry data. We analyze the
proposed encoder for the disentanglement capability of audio effects and also
validate its performance for mixing style transfer through both objective and
subjective evaluations. From the results, we show the proposed system not only
converts the mixing style of multitrack audio close to a reference but is also
robust with mixture-wise style transfer upon using a music source separation
model
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS
Expressive text-to-speech (TTS) can synthesize a new speaking style by
imiating prosody and timbre from a reference audio, which faces the following
challenges: (1) The highly dynamic prosody information in the reference audio
is difficult to extract, especially, when the reference audio contains
background noise. (2) The TTS systems should have good generalization for
unseen speaking styles. In this paper, we present a
\textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech),
which can robustly transfer speaking style in a noisy reference utterance to
synthesized speech. Specifically, our NoreSpeech includes several components:
(1) a novel DiffStyle module, which leverages powerful probabilistic denoising
diffusion models to learn noise-agnostic speaking style features from a teacher
model by knowledge distillation; (2) a VQ-VAE block, which maps the style
features into a controllable quantized latent space for improving the
generalization of style transfer; and (3) a straight-forward but effective
parameter-free text-style alignment module, which enables NoreSpeech to
transfer style to a textual input from a length-mismatched reference utterance.
Experiments demonstrate that NoreSpeech is more effective than previous
expressive TTS models in noise environments. Audio samples and code are
available at:
\href{http://dongchaoyang.top/NoreSpeech\_demo/}{http://dongchaoyang.top/NoreSpeech\_demo/}Comment: Submitted to ICASSP202
Sound Transformation: Applying Image Neural Style Transfer Networks to Audio Spectrograms
Image style transfer networks are used to blend images, producing images that are a mix of source images. The process is based on controlled extraction of style and content aspects of images, using pre-trained Convolutional Neural Networks (CNNs). Our interest lies in adopting these image style transfer networks for the purpose of transforming sounds. Audio signals can be presented as grey-scale images of audio spectrograms. The purpose of our work is to investigate whether audio spectrogram inputs can be used with image neural transfer networks to produce new sounds. Using musical instrument sounds as source sounds, we apply and compare three existing image neural style transfer networks for the task of sound mixing. Our evaluation shows that all three networks are successful in producing consistent, new sounds based on the two source sounds. We use classification models to demonstrate that the new audio signals are consistent and distinguishable from the source instrument sounds. We further apply t-SNE cluster visualisation to visualise the feature maps of the new sounds and original source sounds, confirming that they form different sound groups from the source sounds. Our work paves the way to using CNNs for creative and targeted production of new sounds from source sounds, with specified source qualities, including pitch and timbre
Audio Mixing using Image Neural Style Transfer Networks
Image style transfer networks are used to blend images, producing images that are a mix of source images. The process is based on controlled extraction of style and content aspects of images, using pre-trained Convolutional Neural Networks (CNNs). Our interest lies in adopting these image style transfer networks for the purpose of transforming sounds. Audio signals can be presented as grey-scale images of audio spectrograms. The purpose of our work is to investigate whether audio spectrogram inputs can be used with image neural transfer networks to produce new sounds. Using musical instrument sounds as source sounds, we apply and compare three existing image neural style transfer networks for the task of sound mixing. Our evaluation shows that all three networks are successful in producing consistent, new sounds based on the two source sounds. We use classification models to demonstrate that the new audio signals are consistent and distinguishable from the source instrument sounds. We further apply t-SNE cluster visualisation to visualise the feature maps of the new sounds and original source sounds, confirming that they form different sound groups from the source sounds. Our work paves the way to using CNNs for creative and targeted production of new sounds from source sounds, with specified source qualities, including pitch and timbre
- …