12 research outputs found
Introducing SPAIN (SParse Audio INpainter)
A novel sparsity-based algorithm for audio inpainting is proposed. It is an
adaptation of the SPADE algorithm by Kiti\'c et al., originally developed for
audio declipping, to the task of audio inpainting. The new SPAIN (SParse Audio
INpainter) comes in synthesis and analysis variants. Experiments show that both
A-SPAIN and S-SPAIN outperform other sparsity-based inpainting algorithms.
Moreover, A-SPAIN performs on a par with the state-of-the-art method based on
linear prediction in terms of the SNR, and, for larger gaps, SPAIN is even
slightly better in terms of the PEMO-Q psychoacoustic criterion
Deep speech inpainting of time-frequency masks
Transient loud intrusions, often occurring in noisy environments, can
completely overpower speech signal and lead to an inevitable loss of
information. While existing algorithms for noise suppression can yield
impressive results, their efficacy remains limited for very low signal-to-noise
ratios or when parts of the signal are missing. To address these limitations,
here we propose an end-to-end framework for speech inpainting, the
context-based retrieval of missing or severely distorted parts of
time-frequency representation of speech. The framework is based on a
convolutional U-Net trained via deep feature losses, obtained using speechVGG,
a deep speech feature extractor pre-trained on an auxiliary word classification
task. Our evaluation results demonstrate that the proposed framework can
recover large portions of missing or distorted time-frequency representation of
speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach
provided a substantial increase in STOI & PESQ objective metrics of the
initially corrupted speech samples. Notably, using deep feature losses to train
the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202
ARMAS: Active Reconstruction of Missing Audio Segments
Digital audio signal reconstruction of a lost or corrupt segment using deep
learning algorithms has been explored intensively in recent years.
Nevertheless, prior traditional methods with linear interpolation, phase coding
and tone insertion techniques are still in vogue. However, we found no research
work on reconstructing audio signals with the fusion of dithering,
steganography, and machine learning regressors. Therefore, this paper proposes
the combination of steganography, halftoning (dithering), and state-of-the-art
shallow (RF- Random Forest regression) and deep learning (LSTM- Long Short-Term
Memory) methods. The results (including comparing the SPAIN, Autoregressive,
deep learning-based, graph-based, and other methods) are evaluated with three
different metrics. The observations from the results show that the proposed
solution is effective and can enhance the reconstruction of audio signals
performed by the side information (e.g., Latent representation and learning for
audio inpainting) steganography provides. Moreover, this paper proposes a novel
framework for reconstruction from heavily compressed embedded audio data using
halftoning (i.e., dithering) and machine learning, which we termed the HCR
(halftone-based compression and reconstruction). This work may trigger interest
in optimising this approach and/or transferring it to different domains (i.e.,
image reconstruction). Compared to existing methods, we show improvement in the
inpainting performance in terms of signal-to-noise (SNR), the objective
difference grade (ODG) and the Hansen's audio quality metric.Comment: 9 pages, 2 Tables, 8 Figure
Audio-Visual Speech Inpainting with Deep Learning
In this paper, we present a deep-learning-based framework for audio-visual
speech inpainting, i.e., the task of restoring the missing parts of an acoustic
speech signal from reliable audio context and uncorrupted visual information.
Recent work focuses solely on audio-only methods and generally aims at
inpainting music signals, which show highly different structure than speech.
Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to
investigate the contribution that vision can provide for gaps of different
duration. We also experiment with a multi-task learning approach where a phone
recognition task is learned together with speech inpainting. Results show that
the performance of audio-only speech inpainting approaches degrades rapidly
when gaps get large, while the proposed audio-visual approach is able to
plausibly restore missing information. In addition, we show that multi-task
learning is effective, although the largest contribution to performance comes
from vision