6 research outputs found
M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec
We introduce M3-AUDIODEC, an innovative neural spatial audio codec designed
for efficient compression of multi-channel (binaural) speech in both single and
multi-speaker scenarios, while retaining the spatial location information of
each speaker. This model boasts versatility, allowing configuration and
training tailored to a predetermined set of multi-channel, multi-speaker, and
multi-spatial overlapping speech conditions. Key contributions are as follows:
1) Previous neural codecs are extended from single to multi-channel audios. 2)
The ability of our proposed model to compress and decode for overlapping
speech. 3) A groundbreaking architecture that compresses speech content and
spatial cues separately, ensuring the preservation of each speaker's spatial
context after decoding. 4) M3-AUDIODEC's proficiency in reducing the bandwidth
for compressing two-channel speech by 48% when compared to individual binaural
channel compression. Impressively, at a 12.6 kbps operation, it outperforms
Opus at 24 kbps and AUDIODEC at 24 kbps by 37% and 52%, respectively. In our
assessment, we employed speech enhancement and room acoustic metrics to
ascertain the accuracy of clean speech and spatial cue estimates from
M3-AUDIODEC. Audio demonstrations and source code are available online at
https://github.com/anton-jeran/MULTI-AUDIODEC .Comment: More results and source code are available at
https://anton-jeran.github.io/MAD
Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
Speech codecs learn compact representations of speech signals to facilitate
data transmission. Many recent deep neural network (DNN) based end-to-end
speech codecs achieve low bitrates and high perceptual quality at the cost of
model complexity. We propose a cross-module residual learning (CMRL) pipeline
as a module carrier with each module reconstructing the residual from its
preceding modules. CMRL differs from other DNN-based speech codecs, in that
rather than modeling speech compression problem in a single large neural
network, it optimizes a series of less-complicated modules in a two-phase
training scheme. The proposed method shows better objective performance than
AMR-WB and the state-of-the-art DNN-based speech codec with a similar network
architecture. As an end-to-end model, it takes raw PCM signals as an input, but
is also compatible with linear predictive coding (LPC), showing better
subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved
by using only 0.9 million trainable parameters, a significantly less complex
architecture than the other DNN-based codecs in the literature.Comment: Accepted for publication in INTERSPEECH 201
Neural Feature Predictor and Discriminative Residual Coding for Low-Bitrate Speech Coding
Low and ultra-low-bitrate neural speech coding achieves unprecedented coding
gain by generating speech signals from compact speech features. This paper
introduces additional coding efficiency in neural speech coding by reducing the
temporal redundancy existing in the frame-level feature sequence via a
recurrent neural predictor. The prediction can achieve a low-entropy residual
representation, which we discriminatively code based on their contribution to
the signal reconstruction. The harmonization of feature prediction and
discriminative coding results in a dynamic bit allocation algorithm that spends
more bits on unpredictable but rare events. As a result, we develop a scalable,
lightweight, low-latency, and low-bitrate neural speech coding system. We
demonstrate the advantage of the proposed methods using the LPCNet as a neural
vocoder. While the proposed method guarantees causality in its prediction, the
subjective tests and feature space analysis show that our model achieves
superior coding efficiency compared to LPCNet and Lyra V2 in the very low
bitrates