Packet loss is a major cause of voice quality degradation in VoIP
transmissions with serious impact on intelligibility and user experience. This
paper describes a system based on a generative adversarial approach, which aims
to repair the lost fragments during the transmission of audio streams. Inspired
by the powerful image-to-image translation capability of Generative Adversarial
Networks (GANs), we propose bin2bin, an improved pix2pix framework to achieve
the translation task from magnitude spectrograms of audio frames with lost
packets, to noncorrupted speech spectrograms. In order to better maintain the
structural information after spectrogram translation, this paper introduces the
combination of two STFT-based loss functions, mixed with the traditional GAN
objective. Furthermore, we employ a modified PatchGAN structure as
discriminator and we lower the concealment time by a proper initialization of
the phase reconstruction algorithm. Experimental results show that the proposed
method has obvious advantages when compared with the current state-of-the-art
methods, as it can better handle both high packet loss rates and large gaps.Comment: Accepted at EUSIPCO - 31st European Signal Processing Conference,
202