89 research outputs found
Neural Fourier Shift for Binaural Speech Rendering
We present a neural network for rendering binaural speech from given monaural
audio, position, and orientation of the source. Most of the previous works have
focused on synthesizing binaural speeches by conditioning the positions and
orientations in the feature space of convolutional neural networks. These
synthesis approaches are powerful in estimating the target binaural speeches
even for in-the-wild data but are difficult to generalize for rendering the
audio from out-of-distribution domains. To alleviate this, we propose Neural
Fourier Shift (NFS), a novel network architecture that enables binaural speech
rendering in the Fourier space. Specifically, utilizing a geometric time delay
based on the distance between the source and the receiver, NFS is trained to
predict the delays and scales of various early reflections. NFS is efficient in
both memory and computational cost, is interpretable, and operates
independently of the source domain by its design. With up to 25 times lighter
memory and 6 times fewer calculations, the experimental results show that NFS
outperforms the previous studies on the benchmark dataset.Comment: Submitted to ICASSP 202
Pop2Piano : Pop Audio-based Piano Cover Generation
The piano cover of pop music is widely enjoyed by people. However, the
generation task of the pop piano cover is still understudied. This is partly
due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it
challenging to apply the latest data-intensive deep learning-based methods. To
leverage the power of the data-driven approach, we make a large amount of
paired and synchronized {pop, piano cover} data using an automated pipeline. In
this paper, we present Pop2Piano, a Transformer network that generates piano
covers given waveforms of pop music. To the best of our knowledge, this is the
first model to directly generate a piano cover from pop audio without melody
and chord extraction modules. We show that Pop2Piano trained with our dataset
can generate plausible piano covers
Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features
Estimating Head-Related Transfer Functions (HRTFs) of arbitrary source points
is essential in immersive binaural audio rendering. Computing each individual's
HRTFs is challenging, as traditional approaches require expensive time and
computational resources, while modern data-driven approaches are data-hungry.
Especially for the data-driven approaches, existing HRTF datasets differ in
spatial sampling distributions of source positions, posing a major problem when
generalizing the method across multiple datasets. To alleviate this, we propose
a deep learning method based on a novel conditioning architecture. The proposed
method can predict an HRTF of any position by interpolating the HRTFs of known
distributions. Experimental results show that the proposed architecture
improves the model's generalizability across datasets with various coordinate
systems. Additional demonstrations using coarsened HRTFs demonstrate that the
model robustly reconstructs the target HRTFs from the coarsened data.Comment: Submitted to Interspeech 202
Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling
Recently, there have been efforts to encode the linguistic information of
speech using a self-supervised framework for speech synthesis. However,
predicting representations from surrounding representations can inadvertently
entangle speaker information in the speech representation. This paper aims to
remove speaker information by exploiting the structured nature of speech,
composed of discrete units like phonemes with clear boundaries. A neural
network predicts these boundaries, enabling variable-length pooling for
event-based representation extraction instead of fixed-rate methods. The
boundary predictor outputs a probability for the boundary between 0 and 1,
making pooling soft. The model is trained to minimize the difference with the
pooled representation of the data augmented by time-stretch and pitch-shift. To
confirm that the learned representation includes contents information but is
independent of speaker information, the model was evaluated with libri-light's
phonetic ABX task and SUPERB's speaker identification task
- …