7,103 research outputs found
New single-ended objective measure for non-intrusive speech quality evaluation
peer-reviewedThis article proposes a new output-based method for non-intrusive assessment of speech quality of voice communication systems and evaluates its performance. The method requires access to the processed (degraded) speech only, and is based on measuring perception-motivated objective auditory distances between the voiced parts of the output speech to appropriately matching references extracted from a pre-formulated codebook. The codebook is formed by optimally clustering a large number of parametric speech vectors extracted from a database of clean speech records. The auditory distances are then mapped into objective Mean Opinion listening quality scores. An efficient data-mining tool known as the self-organizing map (SOM) achieves the required clustering and mapping/reference matching processes. In order to obtain a perception-based, speaker-independent parametric representation of the speech, three domain transformation techniques have been investigated. The first technique is based on a perceptual linear prediction (PLP) model, the second utilises a bark spectrum (BS) analysis and the third utilises mel-frequency cepstrum coefficients (MFCC). Reported evaluation results show that the proposed method provides high correlation with subjective listening quality scores, yielding accuracy similar to that of the ITU-T P.563 while maintaining a relatively low computational complexity. Results also demonstrate that the method outperforms the PESQ in a number of distortion conditions, such as those of speech degraded by channel impairments.acceptedpeer-reviewe
Non-intrusive speech quality assessment using context-aware neural networks
To meet the human perceived quality of experience (QoE) while communicating over various Voice over Internet protocol (VoIP) applications, for example Google Meet, Microsoft Skype, Apple FaceTime, etc. a precise speech quality assessment metric is needed. The metric should be able to detect and segregate different types of noise degradations present in the surroundings before measuring and monitoring the quality of speech in real-time. Our research is motivated by the lack of clear evidence presenting speech quality metric that can firstly distinguish different types of noise degradations before providing speech quality prediction decision. To that end, this paper presents a novel non-intrusive speech quality assessment metric using context-aware neural networks in which the noise class (context) of the degraded or noisy speech signal is first identified using a classifier then deep neutral networks (DNNs) based speech quality metrics (SQMs) are trained and optimized for each noise class to obtain the noise class-specific (context-specific) optimized speech quality predictions (MOS scores). The noisy speech signals, that is, clean speech signals degraded by different types of background noises are taken from the NOIZEUS speech corpus. Results demonstrate that even in the presence of less number of speech samples available from the NOIZEUS speech corpus, the proposed metric outperforms in different contexts compared to the metric where the contexts are not classified before speech quality prediction.publishedVersio
On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification
Speech intelligibility can be affected by multiple factors, such as noisy
environments, channel distortions or physiological issues. In this work, we
deal with the problem of automatic prediction of the speech intelligibility
level in this latter case. Starting from our previous work, a non-intrusive
system based on LSTM networks with attention mechanism designed for this task,
we present two main contributions. In the first one, it is proposed the use of
per-frame modulation spectrograms as input features, instead of compact
representations derived from them that discard important temporal information.
In the second one, two different strategies for the combination of per-frame
acoustic log-mel and modulation spectrograms into the LSTM framework are
explored: at decision level or late fusion and at utterance level or
Weighted-Pooling (WP) fusion. The proposed models are evaluated with the
UA-Speech database that contains dysarthric speech with different degrees of
severity. On the one hand, results show that attentional LSTM networks are able
to adequately modeling the modulation spectrograms sequences producing similar
classification rates as in the case of log-mel spectrograms. On the other hand,
both combination strategies, late and WP fusion, outperform the single-feature
systems, suggesting that per-frame log-mel and modulation spectrograms carry
complementary information for the task of speech intelligibility prediction,
than can be effectively exploited by the LSTM-based architectures, being the
system with the WP fusion strategy and Attention-Pooling the one that achieves
best results
- …