19 research outputs found
Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder
In order to efficiently transmit and store speech signals, speech codecs
create a minimally redundant representation of the input signal which is then
decoded at the receiver with the best possible perceptual quality. In this work
we demonstrate that a neural network architecture based on VQ-VAE with a
WaveNet decoder can be used to perform very low bit-rate speech coding with
high reconstruction quality. A prosody-transparent and speaker-independent
model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits
perceptual quality which is around halfway between the MELP codec at 2.4 kbps
and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality
recorded speech with the test speaker included in the training set, a model
coding speech at 1.6 kbps produces output of similar perceptual quality to that
generated by AMR-WB at 23.05 kbps.Comment: ICASSP 201
MFCC-GAN Codec: A New AI-based Audio Coding
In this paper, we proposed AI-based audio coding using MFCC features in an
adversarial setting. We combined a conventional encoder with an adversarial
learning decoder to better reconstruct the original waveform. Since GAN gives
implicit density estimation, therefore, such models are less prone to
overfitting. We compared our work with five well-known codecs namely AAC, AC3,
Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps.
MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a
lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and
Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is
equal to that of AC3_128k, and AAC_112k while having a significantly lower
bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to
AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained
NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k.
For future work, we finally suggest adopting loss functions optimizing
intelligibility and perceptual metrics in the MFCCGAN structure to improve
quality and intelligibility simultaneously.Comment: Accepted in ABU Technical Review journal 2023/
Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
Speech codecs learn compact representations of speech signals to facilitate
data transmission. Many recent deep neural network (DNN) based end-to-end
speech codecs achieve low bitrates and high perceptual quality at the cost of
model complexity. We propose a cross-module residual learning (CMRL) pipeline
as a module carrier with each module reconstructing the residual from its
preceding modules. CMRL differs from other DNN-based speech codecs, in that
rather than modeling speech compression problem in a single large neural
network, it optimizes a series of less-complicated modules in a two-phase
training scheme. The proposed method shows better objective performance than
AMR-WB and the state-of-the-art DNN-based speech codec with a similar network
architecture. As an end-to-end model, it takes raw PCM signals as an input, but
is also compatible with linear predictive coding (LPC), showing better
subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved
by using only 0.9 million trainable parameters, a significantly less complex
architecture than the other DNN-based codecs in the literature.Comment: Accepted for publication in INTERSPEECH 201
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
We propose using self-supervised discrete representations for the task of
speech resynthesis. To generate disentangled representation, we separately
extract low-bitrate representations for speech content, prosodic information,
and speaker identity. This allows to synthesize speech in a controllable
manner. We analyze various state-of-the-art, self-supervised representation
learning methods and shed light on the advantages of each method while
considering reconstruction quality and disentanglement properties.
Specifically, we evaluate the F0 reconstruction, speaker identification
performance (for both resynthesis and voice conversion), recordings'
intelligibility, and overall quality using subjective human evaluation. Lastly,
we demonstrate how these representations can be used for an ultra-lightweight
speech codec. Using the obtained representations, we can get to a rate of 365
bits per second while providing better speech quality than the baseline
methods. Audio samples can be found under the following link:
speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202