30 research outputs found
VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance
Restoring degraded music signals is essential to enhance audio quality for
downstream music manipulation. Recent diffusion-based music restoration methods
have demonstrated impressive performance, and among them, diffusion posterior
sampling (DPS) stands out given its intrinsic properties, making it versatile
across various restoration tasks. In this paper, we identify that there are
potential issues which will degrade current DPS-based methods' performance and
introduce the way to mitigate the issues inspired by diverse diffusion guidance
techniques including the RePaint (RP) strategy and the Pseudoinverse-Guided
Diffusion Models (GDM). We demonstrate our methods for the vocal
declipping and bandwidth extension tasks under various levels of distortion and
cutoff frequency, respectively. In both tasks, our methods outperform the
current DPS-based music restoration benchmarks. We refer to
\url{http://carlosholivan.github.io/demos/audio-restoration-2023.html} for
examples of the restored audio samples
VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration
Speech restoration aims to remove distortions in speech signals. Prior
methods mainly focus on a single type of distortion, such as speech denoising
or dereverberation. However, speech signals can be degraded by several
different distortions simultaneously in the real world. It is thus important to
extend speech restoration models to deal with multiple distortions. In this
paper, we introduce VoiceFixer, a unified framework for high-fidelity speech
restoration. VoiceFixer restores speech from multiple distortions (e.g., noise,
reverberation, and clipping) and can expand degraded speech (e.g., noisy
speech) with a low bandwidth to 44.1 kHz full-bandwidth high-fidelity speech.
We design VoiceFixer based on (1) an analysis stage that predicts
intermediate-level features from the degraded speech, and (2) a synthesis stage
that generates waveform using a neural vocoder. Both objective and subjective
evaluations show that VoiceFixer is effective on severely degraded speech, such
as real-world historical speech recordings. Samples of VoiceFixer are available
at https://haoheliu.github.io/voicefixer.Comment: Submitted to INTERSPEECH 202
Solving Audio Inverse Problems with a Diffusion Model
This paper presents CQT-Diff, a data-driven generative audio model that can,
once trained, be used for solving various different audio inverse problems in a
problem-agnostic setting. CQT-Diff is a neural diffusion model with an
architecture that is carefully constructed to exploit pitch-equivariant
symmetries in music. This is achieved by preconditioning the model with an
invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency
axis represents pitch equivariance as translation equivariance. The proposed
method is evaluated with objective and subjective metrics in three different
and varied tasks: audio bandwidth extension, inpainting, and declipping. The
results show that CQT-Diff outperforms the compared baselines and ablations in
audio bandwidth extension and, without retraining, delivers competitive
performance against modern baselines in audio inpainting and declipping. This
work represents the first diffusion-based general framework for solving inverse
problems in audio processing.Comment: Submitted to ICASSP 202
ARMAS: Active Reconstruction of Missing Audio Segments
Digital audio signal reconstruction of a lost or corrupt segment using deep
learning algorithms has been explored intensively in recent years.
Nevertheless, prior traditional methods with linear interpolation, phase coding
and tone insertion techniques are still in vogue. However, we found no research
work on reconstructing audio signals with the fusion of dithering,
steganography, and machine learning regressors. Therefore, this paper proposes
the combination of steganography, halftoning (dithering), and state-of-the-art
shallow (RF- Random Forest regression) and deep learning (LSTM- Long Short-Term
Memory) methods. The results (including comparing the SPAIN, Autoregressive,
deep learning-based, graph-based, and other methods) are evaluated with three
different metrics. The observations from the results show that the proposed
solution is effective and can enhance the reconstruction of audio signals
performed by the side information (e.g., Latent representation and learning for
audio inpainting) steganography provides. Moreover, this paper proposes a novel
framework for reconstruction from heavily compressed embedded audio data using
halftoning (i.e., dithering) and machine learning, which we termed the HCR
(halftone-based compression and reconstruction). This work may trigger interest
in optimising this approach and/or transferring it to different domains (i.e.,
image reconstruction). Compared to existing methods, we show improvement in the
inpainting performance in terms of signal-to-noise (SNR), the objective
difference grade (ODG) and the Hansen's audio quality metric.Comment: 9 pages, 2 Tables, 8 Figure
Proceedings of the second "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'14)
The implicit objective of the biennial "international - Traveling Workshop on
Interactions between Sparse models and Technology" (iTWIST) is to foster
collaboration between international scientific teams by disseminating ideas
through both specific oral/poster presentations and free discussions. For its
second edition, the iTWIST workshop took place in the medieval and picturesque
town of Namur in Belgium, from Wednesday August 27th till Friday August 29th,
2014. The workshop was conveniently located in "The Arsenal" building within
walking distance of both hotels and town center. iTWIST'14 has gathered about
70 international participants and has featured 9 invited talks, 10 oral
presentations, and 14 posters on the following themes, all related to the
theory, application and generalization of the "sparsity paradigm":
Sparsity-driven data sensing and processing; Union of low dimensional
subspaces; Beyond linear and convex inverse problem; Matrix/manifold/graph
sensing/processing; Blind inverse problems and dictionary learning; Sparsity
and computational neuroscience; Information theory, geometry and randomness;
Complexity/accuracy tradeoffs in numerical methods; Sparsity? What's next?;
Sparse machine learning and inference.Comment: 69 pages, 24 extended abstracts, iTWIST'14 website:
http://sites.google.com/site/itwist1
Music De-limiter Networks via Sample-wise Gain Inversion
The loudness war, an ongoing phenomenon in the music industry characterized
by the increasing final loudness of music while reducing its dynamic range, has
been a controversial topic for decades. Music mastering engineers have used
limiters to heavily compress and make music louder, which can induce ear
fatigue and hearing loss in listeners. In this paper, we introduce music
de-limiter networks that estimate uncompressed music from heavily compressed
signals. Inspired by the principle of a limiter, which performs sample-wise
gain reduction of a given signal, we propose the framework of sample-wise gain
inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k
segments created by applying a commercial limiter plug-in for training
real-world friendly de-limiter networks. Our proposed de-limiter network
achieves excellent performance with a scale-invariant source-to-distortion
ratio (SI-SDR) of 23.8 dB in reconstructing musdb-HQ from musdb- XL data, a
limiter-applied version of musdb-HQ. The training data, codes, and model
weights are available in our repository
(https://github.com/jeonchangbin49/De-limiter).Comment: Accepted to IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA) 202