71 research outputs found
Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder
In order to efficiently transmit and store speech signals, speech codecs
create a minimally redundant representation of the input signal which is then
decoded at the receiver with the best possible perceptual quality. In this work
we demonstrate that a neural network architecture based on VQ-VAE with a
WaveNet decoder can be used to perform very low bit-rate speech coding with
high reconstruction quality. A prosody-transparent and speaker-independent
model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits
perceptual quality which is around halfway between the MELP codec at 2.4 kbps
and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality
recorded speech with the test speaker included in the training set, a model
coding speech at 1.6 kbps produces output of similar perceptual quality to that
generated by AMR-WB at 23.05 kbps.Comment: ICASSP 201
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
We propose using self-supervised discrete representations for the task of
speech resynthesis. To generate disentangled representation, we separately
extract low-bitrate representations for speech content, prosodic information,
and speaker identity. This allows to synthesize speech in a controllable
manner. We analyze various state-of-the-art, self-supervised representation
learning methods and shed light on the advantages of each method while
considering reconstruction quality and disentanglement properties.
Specifically, we evaluate the F0 reconstruction, speaker identification
performance (for both resynthesis and voice conversion), recordings'
intelligibility, and overall quality using subjective human evaluation. Lastly,
we demonstrate how these representations can be used for an ultra-lightweight
speech codec. Using the obtained representations, we can get to a rate of 365
bits per second while providing better speech quality than the baseline
methods. Audio samples can be found under the following link:
speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202
A Two-Stage Training Framework for Joint Speech Compression and Enhancement
This paper considers the joint compression and enhancement problem for speech
signal in the presence of noise. Recently, the SoundStream codec, which relies
on end-to-end joint training of an encoder-decoder pair and a residual vector
quantizer by a combination of adversarial and reconstruction losses,has shown
very promising performance, especially in subjective perception quality. In
this work, we provide a theoretical result to show that, to simultaneously
achieve low distortion and high perception in the presence of noise, there
exist an optimal two-stage optimization procedure for the joint compression and
enhancement problem. This procedure firstly optimizes an encoder-decoder pair
using only distortion loss and then fixes the encoder to optimize a perceptual
decoder using perception loss. Based on this result, we construct a two-stage
training framework for joint compression and enhancement of noisy speech
signal. Unlike existing training methods which are heuristic, the proposed
two-stage training method has a theoretical foundation. Finally, experimental
results for various noise and bit-rate conditions are provided. The results
demonstrate that a codec trained by the proposed framework can outperform
SoundStream and other representative codecs in terms of both objective and
subjective evaluation metrics. Code is available at
\textit{https://github.com/jscscloris/SEStream}
Variational Speech Waveform Compression to Catalyze Semantic Communications
We propose a novel neural waveform compression method to catalyze emerging
speech semantic communications. By introducing nonlinear transform and
variational modeling, we effectively capture the dependencies within speech
frames and estimate the probabilistic distribution of the speech feature more
accurately, giving rise to better compression performance. In particular, the
speech signals are analyzed and synthesized by a pair of nonlinear transforms,
yielding latent features. An entropy model with hyperprior is built to capture
the probabilistic distribution of latent features, followed with quantization
and entropy coding. The proposed waveform codec can be optimized flexibly
towards arbitrary rate, and the other appealing feature is that it can be
easily optimized for any differentiable loss function, including perceptual
loss used in semantic communications. To further improve the fidelity, we
incorporate residual coding to mitigate the degradation arising from
quantization distortion at the latent space. Results indicate that achieving
the same performance, the proposed method saves up to 27% coding rate than
widely used adaptive multi-rate wideband (AMR-WB) codec as well as emerging
neural waveform coding methods
MFCC-GAN Codec: A New AI-based Audio Coding
In this paper, we proposed AI-based audio coding using MFCC features in an
adversarial setting. We combined a conventional encoder with an adversarial
learning decoder to better reconstruct the original waveform. Since GAN gives
implicit density estimation, therefore, such models are less prone to
overfitting. We compared our work with five well-known codecs namely AAC, AC3,
Opus, Vorbis, and Speex, performing on bitrates from 2kbps to 128kbps.
MFCCGAN_36k achieved the state-of-the-art result in terms of SNR despite a
lower bitrate in comparison to AC3_128k, AAC_112k, Vorbis_48k, Opus_48k, and
Speex_48K. On the other hand, MFCCGAN_13k also achieved high SNR=27 which is
equal to that of AC3_128k, and AAC_112k while having a significantly lower
bitrate (13 kbps). MFCCGAN_36k achieved higher NISQA-MOS results compared to
AAC_48k while having a 20% lower bitrate. Furthermore, MFCCGAN_13k obtained
NISQAMOS= 3.9 which is much higher than AAC_24k, AAC_32k, AC3_32k, and AAC_48k.
For future work, we finally suggest adopting loss functions optimizing
intelligibility and perceptual metrics in the MFCCGAN structure to improve
quality and intelligibility simultaneously.Comment: Accepted in ABU Technical Review journal 2023/
- …