22 research outputs found
A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment
Packet loss is a major cause of voice quality degradation in VoIP
transmissions with serious impact on intelligibility and user experience. This
paper describes a system based on a generative adversarial approach, which aims
to repair the lost fragments during the transmission of audio streams. Inspired
by the powerful image-to-image translation capability of Generative Adversarial
Networks (GANs), we propose bin2bin, an improved pix2pix framework to achieve
the translation task from magnitude spectrograms of audio frames with lost
packets, to noncorrupted speech spectrograms. In order to better maintain the
structural information after spectrogram translation, this paper introduces the
combination of two STFT-based loss functions, mixed with the traditional GAN
objective. Furthermore, we employ a modified PatchGAN structure as
discriminator and we lower the concealment time by a proper initialization of
the phase reconstruction algorithm. Experimental results show that the proposed
method has obvious advantages when compared with the current state-of-the-art
methods, as it can better handle both high packet loss rates and large gaps.Comment: Accepted at EUSIPCO - 31st European Signal Processing Conference,
202
Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
Speech codecs learn compact representations of speech signals to facilitate
data transmission. Many recent deep neural network (DNN) based end-to-end
speech codecs achieve low bitrates and high perceptual quality at the cost of
model complexity. We propose a cross-module residual learning (CMRL) pipeline
as a module carrier with each module reconstructing the residual from its
preceding modules. CMRL differs from other DNN-based speech codecs, in that
rather than modeling speech compression problem in a single large neural
network, it optimizes a series of less-complicated modules in a two-phase
training scheme. The proposed method shows better objective performance than
AMR-WB and the state-of-the-art DNN-based speech codec with a similar network
architecture. As an end-to-end model, it takes raw PCM signals as an input, but
is also compatible with linear predictive coding (LPC), showing better
subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved
by using only 0.9 million trainable parameters, a significantly less complex
architecture than the other DNN-based codecs in the literature.Comment: Accepted for publication in INTERSPEECH 201