Search CORE

728 research outputs found

Introducing SPAIN (SParse Audio INpainter)

Author: blumensath
christensen
efron
esquef
fink
gaultier
kiti?
lagrange
marafioti
siedenburg
záviška
záviška
záviška
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/06/2019
Field of study

A novel sparsity-based algorithm for audio inpainting is proposed. It is an adaptation of the SPADE algorithm by Kiti\'c et al., originally developed for audio declipping, to the task of audio inpainting. The new SPAIN (SParse Audio INpainter) comes in synthesis and analysis variants. Experiments show that both A-SPAIN and S-SPAIN outperform other sparsity-based inpainting algorithms. Moreover, A-SPAIN performs on a par with the state-of-the-art method based on linear prediction in terms of the SNR, and, for larger gaps, SPAIN is even slightly better in terms of the PEMO-Q psychoacoustic criterion

arXiv.org e-Print Archive

Crossref

Deep speech inpainting of time-frequency masks

Author: Beckmann Pierre
Cernak Milos
Kegler Mikolaj
Publication venue: 'International Speech Communication Association'
Publication date: 29/08/2020
Field of study

Transient loud intrusions, often occurring in noisy environments, can completely overpower speech signal and lead to an inevitable loss of information. While existing algorithms for noise suppression can yield impressive results, their efficacy remains limited for very low signal-to-noise ratios or when parts of the signal are missing. To address these limitations, here we propose an end-to-end framework for speech inpainting, the context-based retrieval of missing or severely distorted parts of time-frequency representation of speech. The framework is based on a convolutional U-Net trained via deep feature losses, obtained using speechVGG, a deep speech feature extractor pre-trained on an auxiliary word classification task. Our evaluation results demonstrate that the proposed framework can recover large portions of missing or distorted time-frequency representation of speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach provided a substantial increase in STOI & PESQ objective metrics of the initially corrupted speech samples. Notably, using deep feature losses to train the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202

arXiv.org e-Print Archive

Crossref

Speech inpainting: Context-based speech synthesis guided by video

Author: Haro Gloria
Jensen Jesper
Michelsanti Daniel
Montesinos Juan F.
Tan Zheng-Hua
Publication venue
Publication date: 01/06/2023
Field of study

Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.Comment: Accepted in Interspeech2

arXiv.org e-Print Archive