728 research outputs found
Introducing SPAIN (SParse Audio INpainter)
A novel sparsity-based algorithm for audio inpainting is proposed. It is an
adaptation of the SPADE algorithm by Kiti\'c et al., originally developed for
audio declipping, to the task of audio inpainting. The new SPAIN (SParse Audio
INpainter) comes in synthesis and analysis variants. Experiments show that both
A-SPAIN and S-SPAIN outperform other sparsity-based inpainting algorithms.
Moreover, A-SPAIN performs on a par with the state-of-the-art method based on
linear prediction in terms of the SNR, and, for larger gaps, SPAIN is even
slightly better in terms of the PEMO-Q psychoacoustic criterion
Deep speech inpainting of time-frequency masks
Transient loud intrusions, often occurring in noisy environments, can
completely overpower speech signal and lead to an inevitable loss of
information. While existing algorithms for noise suppression can yield
impressive results, their efficacy remains limited for very low signal-to-noise
ratios or when parts of the signal are missing. To address these limitations,
here we propose an end-to-end framework for speech inpainting, the
context-based retrieval of missing or severely distorted parts of
time-frequency representation of speech. The framework is based on a
convolutional U-Net trained via deep feature losses, obtained using speechVGG,
a deep speech feature extractor pre-trained on an auxiliary word classification
task. Our evaluation results demonstrate that the proposed framework can
recover large portions of missing or distorted time-frequency representation of
speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach
provided a substantial increase in STOI & PESQ objective metrics of the
initially corrupted speech samples. Notably, using deep feature losses to train
the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202
Speech inpainting: Context-based speech synthesis guided by video
Audio and visual modalities are inherently connected in speech signals: lip
movements and facial expressions are correlated with speech sounds. This
motivates studies that incorporate the visual modality to enhance an acoustic
speech signal or even restore missing audio information. Specifically, this
paper focuses on the problem of audio-visual speech inpainting, which is the
task of synthesizing the speech in a corrupted audio segment in a way that it
is consistent with the corresponding visual content and the uncorrupted audio
context. We present an audio-visual transformer-based deep learning model that
leverages visual cues that provide information about the content of the
corrupted audio. It outperforms the previous state-of-the-art audio-visual
model and audio-only baselines. We also show how visual features extracted with
AV-HuBERT, a large audio-visual transformer for speech recognition, are
suitable for synthesizing speech.Comment: Accepted in Interspeech2
- …