20 research outputs found
Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music
This paper presents a new input format, channel-wise subband input (CWS), for
convolutional neural networks (CNN) based music source separation (MSS) models
in the frequency domain. We aim to address the major issues in CNN-based
high-resolution MSS model: high computational cost and weight sharing between
distinctly different bands. Specifically, in this paper, we decompose the input
mixture spectra into several bands and concatenate them channel-wise as the
model input. The proposed approach enables effective weight sharing in each
subband and introduces more flexibility between channels. For comparison
purposes, we perform voice and accompaniment separation (VAS) on models with
different scales, architectures, and CWS settings. Experiments show that the
CWS input is beneficial in many aspects. We evaluate our method on musdb18hq
test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS
enables models to obtain 6.9% performance gain on the average metrics. With
even a smaller number of parameters, less training data, and shorter training
time, our MDenseNet with 8-bands CWS input still surpasses the original
MMDenseNet with a large margin. Moreover, CWS also reduces computational cost
and training time to a large extent.Comment: Accepted in INTERSPEECH 202
Leveraging Pre-trained AudioLDM for Text to Sound Generation: A Benchmark Study
Deep neural networks have recently achieved breakthroughs in sound generation
with text prompts. Despite their promising performance, current text-to-sound
generation models face issues on small-scale datasets (e.g., overfitting),
significantly limiting their performance. In this paper, we investigate the use
of pre-trained AudioLDM, the state-of-the-art model for text-to-audio
generation, as the backbone for sound generation. Our study demonstrates the
advantages of using pre-trained models for text-to-sound generation, especially
in data-scarcity scenarios. In addition, experiments show that different
training strategies (e.g., training conditions) may affect the performance of
AudioLDM on datasets of different scales. To facilitate future studies, we also
evaluate various text-to-sound generation systems on several frequently used
datasets under the same evaluation protocols, which allow fair comparisons and
benchmarking of these methods on the common ground.Comment: EUSIPCO 202
Text-Driven Foley Sound Generation With Latent Diffusion Model
Foley sound generation aims to synthesise the background sound for multimedia
content. Previous models usually employ a large development set with labels as
input (e.g., single numbers or one-hot vector). In this work, we propose a
diffusion model based system for Foley sound generation with text conditions.
To alleviate the data scarcity issue, our model is initially pre-trained with
large-scale datasets and fine-tuned to this task via transfer learning using
the contrastive language-audio pertaining (CLAP) technique. We have observed
that the feature embedding extracted by the text encoder can significantly
affect the performance of the generation model. Hence, we introduce a trainable
layer after the encoder to improve the text embedding produced by the encoder.
In addition, we further refine the generated waveform by generating multiple
candidate audio clips simultaneously and selecting the best one, which is
determined in terms of the similarity score between the embedding of the
candidate clips and the embedding of the target text label. Using the proposed
method, our system ranks among the systems submitted to DCASE
Challenge 2023 Task 7. The results of the ablation studies illustrate that the
proposed techniques significantly improve sound generation performance. The
codes for implementing the proposed system are available online.Comment: Submit to DCASE-workshop 2023. arXiv admin note: text overlap with
arXiv:2305.1590
AudioSR: Versatile Audio Super-resolution at Scale
Audio super-resolution is a fundamental task that predicts high-frequency
components for low-resolution audio, enhancing audio quality in digital
applications. Previous methods have limitations such as the limited scope of
audio types (e.g., music, speech) and specific bandwidth settings they can
handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based
generative model, AudioSR, that is capable of performing robust audio
super-resolution on versatile audio types, including sound effects, music, and
speech. Specifically, AudioSR can upsample any input audio signal within the
bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz
bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on
various audio super-resolution benchmarks demonstrates the strong result
achieved by the proposed model. In addition, our subjective evaluation shows
that AudioSR can acts as a plug-and-play module to enhance the generation
quality of a wide range of audio generative models, including AudioLDM,
Fastspeech2, and MusicGen. Our code and demo are available at
https://audioldm.github.io/audiosr.Comment: Under review. Demo and code: https://audioldm.github.io/audios
Adapting Language-Audio Models as Few-Shot Audio Learners
We presented the Treff adapter, a training-efficient adapter for CLAP, to
boost zero-shot classification performance by making use of a small set of
labelled data. Specifically, we designed CALM to retrieve the probability
distribution of text-audio clips over classes using a set of audio-label pairs
and combined it with CLAP's zero-shot classification results. Furthermore, we
designed a training-free version of the Treff adapter by using CALM as a cosine
similarity measure. Experiments showed that the proposed Treff adapter is
comparable and even better than fully-supervised methods and adaptation methods
in low-shot and data-abundant scenarios. While the Treff adapter shows that
combining large-scale pretraining and rapid learning of domain-specific
knowledge is non-trivial for obtaining generic representations for few-shot
learning, it is still limited to audio classification tasks. In the future, we
will explore how to use audio-language models in diverse audio domains
VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration
Speech restoration aims to remove distortions in speech signals. Prior
methods mainly focus on a single type of distortion, such as speech denoising
or dereverberation. However, speech signals can be degraded by several
different distortions simultaneously in the real world. It is thus important to
extend speech restoration models to deal with multiple distortions. In this
paper, we introduce VoiceFixer, a unified framework for high-fidelity speech
restoration. VoiceFixer restores speech from multiple distortions (e.g., noise,
reverberation, and clipping) and can expand degraded speech (e.g., noisy
speech) with a low bandwidth to 44.1 kHz full-bandwidth high-fidelity speech.
We design VoiceFixer based on (1) an analysis stage that predicts
intermediate-level features from the degraded speech, and (2) a synthesis stage
that generates waveform using a neural vocoder. Both objective and subjective
evaluations show that VoiceFixer is effective on severely degraded speech, such
as real-world historical speech recordings. Samples of VoiceFixer are available
at https://haoheliu.github.io/voicefixer.Comment: Submitted to INTERSPEECH 202