2 research outputs found
AMRConvNet: AMR-Coded Speech Enhancement Using Convolutional Neural Networks
Speech is converted to digital signals using speech coding for efficient
transmission. However, this often lowers the quality and bandwidth of speech.
This paper explores the application of convolutional neural networks for
Artificial Bandwidth Expansion (ABE) and speech enhancement on coded speech,
particularly Adaptive Multi-Rate (AMR) used in 2G cellular phone calls. In this
paper, we introduce AMRConvNet: a convolutional neural network that performs
ABE and speech enhancement on speech encoded with AMR. The model operates
directly on the time-domain for both input and output speech but optimizes
using combined time-domain reconstruction loss and frequency-domain perceptual
loss. AMRConvNet resulted in an average improvement of 0.425 Mean Opinion Score
- Listening Quality Objective (MOS-LQO) points for AMR bitrate of 4.75k, and
0.073 MOS-LQO points for AMR bitrate of 12.2k. AMRConvNet also showed
robustness in AMR bitrate inputs. Finally, an ablation test showed that our
combined time-domain and frequency-domain loss leads to slightly higher MOS-LQO
and faster training convergence than using either loss alone.Comment: IEEE SMC 202
Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension
This paper presents a waveform modeling and generation method using
hierarchical recurrent neural networks (HRNN) for speech bandwidth extension
(BWE). Different from conventional BWE methods which predict spectral
parameters for reconstructing wideband speech waveforms, this BWE method models
and predicts waveform samples directly without using vocoders. Inspired by
SampleRNN which is an unconditional neural audio generator, the HRNN model
represents the distribution of each wideband or high-frequency waveform sample
conditioned on the input narrowband waveform samples using a neural network
composed of long short-term memory (LSTM) layers and feed-forward (FF) layers.
The LSTM layers form a hierarchical structure and each layer operates at a
specific temporal resolution to efficiently capture long-span dependencies
between temporal sequences. Furthermore, additional conditions, such as the
bottleneck (BN) features derived from narrowband speech using a deep neural
network (DNN)-based state classifier, are employed as auxiliary input to
further improve the quality of generated wideband speech. The experimental
results of comparing several waveform modeling methods show that the HRNN-based
method can achieve better speech quality and run-time efficiency than the
dilated convolutional neural network (DCNN)-based method and the plain
sample-level recurrent neural network (SRNN)-based method. Our proposed method
also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in
terms of the subjective quality of the reconstructed wideband speech.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processin