84 research outputs found
Neural Fourier Shift for Binaural Speech Rendering
We present a neural network for rendering binaural speech from given monaural
audio, position, and orientation of the source. Most of the previous works have
focused on synthesizing binaural speeches by conditioning the positions and
orientations in the feature space of convolutional neural networks. These
synthesis approaches are powerful in estimating the target binaural speeches
even for in-the-wild data but are difficult to generalize for rendering the
audio from out-of-distribution domains. To alleviate this, we propose Neural
Fourier Shift (NFS), a novel network architecture that enables binaural speech
rendering in the Fourier space. Specifically, utilizing a geometric time delay
based on the distance between the source and the receiver, NFS is trained to
predict the delays and scales of various early reflections. NFS is efficient in
both memory and computational cost, is interpretable, and operates
independently of the source domain by its design. With up to 25 times lighter
memory and 6 times fewer calculations, the experimental results show that NFS
outperforms the previous studies on the benchmark dataset.Comment: Submitted to ICASSP 202
Pop2Piano : Pop Audio-based Piano Cover Generation
The piano cover of pop music is widely enjoyed by people. However, the
generation task of the pop piano cover is still understudied. This is partly
due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it
challenging to apply the latest data-intensive deep learning-based methods. To
leverage the power of the data-driven approach, we make a large amount of
paired and synchronized {pop, piano cover} data using an automated pipeline. In
this paper, we present Pop2Piano, a Transformer network that generates piano
covers given waveforms of pop music. To the best of our knowledge, this is the
first model to directly generate a piano cover from pop audio without melody
and chord extraction modules. We show that Pop2Piano trained with our dataset
can generate plausible piano covers
Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features
Estimating Head-Related Transfer Functions (HRTFs) of arbitrary source points
is essential in immersive binaural audio rendering. Computing each individual's
HRTFs is challenging, as traditional approaches require expensive time and
computational resources, while modern data-driven approaches are data-hungry.
Especially for the data-driven approaches, existing HRTF datasets differ in
spatial sampling distributions of source positions, posing a major problem when
generalizing the method across multiple datasets. To alleviate this, we propose
a deep learning method based on a novel conditioning architecture. The proposed
method can predict an HRTF of any position by interpolating the HRTFs of known
distributions. Experimental results show that the proposed architecture
improves the model's generalizability across datasets with various coordinate
systems. Additional demonstrations using coarsened HRTFs demonstrate that the
model robustly reconstructs the target HRTFs from the coarsened data.Comment: Submitted to Interspeech 202
Music De-limiter Networks via Sample-wise Gain Inversion
The loudness war, an ongoing phenomenon in the music industry characterized
by the increasing final loudness of music while reducing its dynamic range, has
been a controversial topic for decades. Music mastering engineers have used
limiters to heavily compress and make music louder, which can induce ear
fatigue and hearing loss in listeners. In this paper, we introduce music
de-limiter networks that estimate uncompressed music from heavily compressed
signals. Inspired by the principle of a limiter, which performs sample-wise
gain reduction of a given signal, we propose the framework of sample-wise gain
inversion (SGI). We also present the musdb-XL-train dataset, consisting of 300k
segments created by applying a commercial limiter plug-in for training
real-world friendly de-limiter networks. Our proposed de-limiter network
achieves excellent performance with a scale-invariant source-to-distortion
ratio (SI-SDR) of 23.8 dB in reconstructing musdb-HQ from musdb- XL data, a
limiter-applied version of musdb-HQ. The training data, codes, and model
weights are available in our repository
(https://github.com/jeonchangbin49/De-limiter).Comment: Accepted to IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA) 202
- …