8,016 research outputs found
Zero-Shot Blind Audio Bandwidth Extension
Audio bandwidth extension involves the realistic reconstruction of
high-frequency spectra from bandlimited observations. In cases where the
lowpass degradation is unknown, such as in restoring historical audio
recordings, this becomes a blind problem. This paper introduces a novel method
called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem
in a zero-shot setting, leveraging the generative priors of a pre-trained
unconditional diffusion model. During the inference process, BABE utilizes a
generalized version of diffusion posterior sampling, where the degradation
operator is unknown but parametrized and inferred iteratively. The performance
of the proposed method is evaluated using objective and subjective metrics, and
the results show that BABE surpasses state-of-the-art blind bandwidth extension
baselines and achieves competitive performance compared to non-blind
filter-informed methods when tested with synthetic data. Moreover, BABE
exhibits robust generalization capabilities when enhancing real historical
recordings, effectively reconstructing the missing high-frequency content while
maintaining coherence with the original recording. Subjective preference tests
confirm that BABE significantly improves the audio quality of historical music
recordings. Examples of historical recordings restored with the proposed method
are available on the companion webpage:
(http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
AERO: Audio Super Resolution in the Spectral Domain
We present AERO, a audio super-resolution model that processes speech and
music signals in the spectral domain. AERO is based on an encoder-decoder
architecture with U-Net like skip connections. We optimize the model using both
time and frequency domain loss functions. Specifically, we consider a set of
reconstruction losses together with perceptual ones in the form of adversarial
and feature discriminator loss functions. To better handle phase information
the proposed method operates over the complex-valued spectrogram using two
separate channels. Unlike prior work which mainly considers low and high
frequency concatenation for audio super-resolution, the proposed method
directly predicts the full frequency range. We demonstrate high performance
across a wide range of sample rates considering both speech and music. AERO
outperforms the evaluated baselines considering Log-Spectral Distance, ViSQOL,
and the subjective MUSHRA test. Audio samples and code are available at
https://pages.cs.huji.ac.il/adiyoss-lab/aer
Perceptually smooth timbral guides by state-space analysis of phase-vocoder parameters
Sculptor is a phase-vocoder-based package of programs
that allows users to explore timbral manipulation
of sound in real time. It is the product
of a research program seeking ultimately to perform
gestural capture by analysis of the sound a
performer makes using a conventional instrument.
Since the phase-vocoder output is of high dimensionality —
typically more than 1,000 channels per
analysis frame—mapping phase-vocoder output to
appropriate input parameters for a synthesizer is
only feasible in theory
- …