97 research outputs found
Coded Speech Quality Measurement by a Non-Intrusive PESQ-DNN
Wideband codecs such as AMR-WB or EVS are widely used in (mobile) speech
communication. Evaluation of coded speech quality is often performed
subjectively by an absolute category rating (ACR) listening test. However, the
ACR test is impractical for online monitoring of speech communication networks.
Perceptual evaluation of speech quality (PESQ) is one of the widely used
metrics instrumentally predicting the results of an ACR test. However, the PESQ
algorithm requires an original reference signal, which is usually unavailable
in network monitoring, thus limiting its applicability. NISQA is a new
non-intrusive neural-network-based speech quality measure, focusing on
super-wideband speech signals. In this work, however, we aim at predicting the
well-known PESQ metric using a non-intrusive PESQ-DNN model. We illustrate the
potential of this model by predicting the PESQ scores of wideband-coded speech
obtained from AMR-WB or EVS codecs operating at different bitrates in noisy,
tandeming, and error-prone transmission conditions. We compare our methods with
the state-of-the-art network topologies of QualityNet, WaweNet, and DNSMOS --
all applied to PESQ prediction -- by measuring the mean absolute error (MAE)
and the linear correlation coefficient (LCC). The proposed PESQ-DNN offers the
best total MAE and LCC of 0.11 and 0.92, respectively, in conditions without
frame loss, and still is best when including frame loss. Note that our model
could be similarly used to non-intrusively predict POLQA or other (intrusive)
metrics. Upon article acceptance, code will be provided at GitHub
Efficient Acoustic Echo Suppression with Condition-Aware Training
The topic of deep acoustic echo control (DAEC) has seen many approaches with
various model topologies in recent years. Convolutional recurrent networks
(CRNs), consisting of a convolutional encoder and decoder encompassing a
recurrent bottleneck, are repeatedly employed due to their ability to preserve
nearend speech even in double-talk (DT) condition. However, past architectures
are either computationally complex or trade off smaller model sizes with a
decrease in performance. We propose an improved CRN topology which, compared to
other realizations of this class of architectures, not only saves parameters
and computational complexity, but also shows improved performance in DT,
outperforming both baseline architectures FCRN and CRUSE. Striving for a
condition-aware training, we also demonstrate the importance of a high
proportion of double-talk and the missing value of nearend-only speech in DAEC
training data. Finally, we show how to control the trade-off between aggressive
echo suppression and near-end speech preservation by fine-tuning with
condition-aware component loss functions.Comment: 5 pages, accepted to WASPAA 202
Employing Real Training Data for Deep Noise Suppression
Most deep noise suppression (DNS) models are trained with reference-based
losses requiring access to clean speech. However, sometimes an additive
microphone model is insufficient for real-world applications. Accordingly, ways
to use real training data in supervised learning for DNS models promise to
reduce a potential training/inference mismatch. Employing real data for DNS
training requires either generative approaches or a reference-free loss without
access to the corresponding clean speech. In this work, we propose to employ an
end-to-end non-intrusive deep neural network (DNN), named PESQ-DNN, to estimate
perceptual evaluation of speech quality (PESQ) scores of enhanced real data. It
provides a reference-free perceptual loss for employing real data during DNS
training, maximizing the PESQ scores. Furthermore, we use an epoch-wise
alternating training protocol, updating the DNS model on real data, followed by
PESQ-DNN updating on synthetic data. The DNS model trained with the PESQ-DNN
employing real data outperforms all reference methods employing only synthetic
training data. On synthetic test data, our proposed method excels the
Interspeech 2021 DNS Challenge baseline by a significant 0.32 PESQ points. Both
on synthetic and real test data, the proposed method beats the baseline by 0.05
DNSMOS points - although PESQ-DNN optimizes for a different perceptual metric
- …