783 research outputs found
A Two-Stage Training Framework for Joint Speech Compression and Enhancement
This paper considers the joint compression and enhancement problem for speech
signal in the presence of noise. Recently, the SoundStream codec, which relies
on end-to-end joint training of an encoder-decoder pair and a residual vector
quantizer by a combination of adversarial and reconstruction losses,has shown
very promising performance, especially in subjective perception quality. In
this work, we provide a theoretical result to show that, to simultaneously
achieve low distortion and high perception in the presence of noise, there
exist an optimal two-stage optimization procedure for the joint compression and
enhancement problem. This procedure firstly optimizes an encoder-decoder pair
using only distortion loss and then fixes the encoder to optimize a perceptual
decoder using perception loss. Based on this result, we construct a two-stage
training framework for joint compression and enhancement of noisy speech
signal. Unlike existing training methods which are heuristic, the proposed
two-stage training method has a theoretical foundation. Finally, experimental
results for various noise and bit-rate conditions are provided. The results
demonstrate that a codec trained by the proposed framework can outperform
SoundStream and other representative codecs in terms of both objective and
subjective evaluation metrics. Code is available at
\textit{https://github.com/jscscloris/SEStream}
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
We propose using self-supervised discrete representations for the task of
speech resynthesis. To generate disentangled representation, we separately
extract low-bitrate representations for speech content, prosodic information,
and speaker identity. This allows to synthesize speech in a controllable
manner. We analyze various state-of-the-art, self-supervised representation
learning methods and shed light on the advantages of each method while
considering reconstruction quality and disentanglement properties.
Specifically, we evaluate the F0 reconstruction, speaker identification
performance (for both resynthesis and voice conversion), recordings'
intelligibility, and overall quality using subjective human evaluation. Lastly,
we demonstrate how these representations can be used for an ultra-lightweight
speech codec. Using the obtained representations, we can get to a rate of 365
bits per second while providing better speech quality than the baseline
methods. Audio samples can be found under the following link:
speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202
Variational Speech Waveform Compression to Catalyze Semantic Communications
We propose a novel neural waveform compression method to catalyze emerging
speech semantic communications. By introducing nonlinear transform and
variational modeling, we effectively capture the dependencies within speech
frames and estimate the probabilistic distribution of the speech feature more
accurately, giving rise to better compression performance. In particular, the
speech signals are analyzed and synthesized by a pair of nonlinear transforms,
yielding latent features. An entropy model with hyperprior is built to capture
the probabilistic distribution of latent features, followed with quantization
and entropy coding. The proposed waveform codec can be optimized flexibly
towards arbitrary rate, and the other appealing feature is that it can be
easily optimized for any differentiable loss function, including perceptual
loss used in semantic communications. To further improve the fidelity, we
incorporate residual coding to mitigate the degradation arising from
quantization distortion at the latent space. Results indicate that achieving
the same performance, the proposed method saves up to 27% coding rate than
widely used adaptive multi-rate wideband (AMR-WB) codec as well as emerging
neural waveform coding methods
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes
We propose a method named AudioFormer,which learns audio feature
representations through the acquisition of discrete acoustic codes and
subsequently fine-tunes them for audio classification tasks. Initially,we
introduce a novel perspective by considering the audio classification task as a
form of natural language understanding (NLU). Leveraging an existing neural
audio codec model,we generate discrete acoustic codes and utilize them to train
a masked language model (MLM),thereby obtaining audio feature representations.
Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive
(MPC) learning approach. This method enables the learning of joint
representations among multiple discrete acoustic codes within the same audio
input. In our experiments,we treat discrete acoustic codes as textual data and
train a masked language model using a cloze-like methodology,ultimately
deriving high-quality audio representations. Notably,the MPC learning technique
effectively captures collaborative representations among distinct positive
samples. Our research outcomes demonstrate that AudioFormer attains
significantly improved performance compared to prevailing monomodal audio
classification models across multiple datasets,and even outperforms
audio-visual multimodal classification models on select datasets.
Specifically,our approach achieves remarkable results on datasets including
AudioSet (2M,20K),and FSD50K,with performance scores of 53.9,45.1,and
65.6,respectively. We have openly shared both the code and models:
https://github.com/LZH-0225/AudioFormer.git.Comment: 9 pages, 4 figure
- …