1 research outputs found
Attention-based Speech Enhancement Using Human Quality Perception Modelling
Perceptually-inspired objective functions such as the perceptual evaluation
of speech quality (PESQ), signal-to-distortion ratio (SDR), and short-time
objective intelligibility (STOI), have recently been used to optimize
performance of deep-learning-based speech enhancement algorithms. These
objective functions, however, do not always strongly correlate with a
listener's assessment of perceptual quality, so optimizing with these measures
often results in poorer performance in real-world scenarios. In this work, we
propose an attention-based enhancement approach that uses learned speech
embedding vectors from a mean-opinion score (MOS) prediction model and a speech
enhancement module to jointly enhance noisy speech. The MOS prediction model
estimates the perceptual MOS of speech quality, as assessed by human listeners,
directly from the audio signal. The enhancement module also employs a quantized
language model that enforces spectral constraints for better speech realism and
performance. We train the model using real-world noisy speech data that has
been captured in everyday environments and test it using unseen corpora. The
results show that our proposed approach significantly outperforms other
approaches that are optimized with objective measures, where the predicted
quality scores strongly correlate with human judgments.Comment: 11 pages, 4 figures, 3 tables, submitted in journal TASLP 202