364 research outputs found
Coded Speech Quality Measurement by a Non-Intrusive PESQ-DNN
Wideband codecs such as AMR-WB or EVS are widely used in (mobile) speech
communication. Evaluation of coded speech quality is often performed
subjectively by an absolute category rating (ACR) listening test. However, the
ACR test is impractical for online monitoring of speech communication networks.
Perceptual evaluation of speech quality (PESQ) is one of the widely used
metrics instrumentally predicting the results of an ACR test. However, the PESQ
algorithm requires an original reference signal, which is usually unavailable
in network monitoring, thus limiting its applicability. NISQA is a new
non-intrusive neural-network-based speech quality measure, focusing on
super-wideband speech signals. In this work, however, we aim at predicting the
well-known PESQ metric using a non-intrusive PESQ-DNN model. We illustrate the
potential of this model by predicting the PESQ scores of wideband-coded speech
obtained from AMR-WB or EVS codecs operating at different bitrates in noisy,
tandeming, and error-prone transmission conditions. We compare our methods with
the state-of-the-art network topologies of QualityNet, WaweNet, and DNSMOS --
all applied to PESQ prediction -- by measuring the mean absolute error (MAE)
and the linear correlation coefficient (LCC). The proposed PESQ-DNN offers the
best total MAE and LCC of 0.11 and 0.92, respectively, in conditions without
frame loss, and still is best when including frame loss. Note that our model
could be similarly used to non-intrusively predict POLQA or other (intrusive)
metrics. Upon article acceptance, code will be provided at GitHub
Quality of media traffic over Lossy internet protocol networks: Measurement and improvement.
Voice over Internet Protocol (VoIP) is an active area of research in the world of
communication. The high revenue made by the telecommunication companies is a
motivation to develop solutions that transmit voice over other media rather than
the traditional, circuit switching network.
However, while IP networks can carry data traffic very well due to their besteffort
nature, they are not designed to carry real-time applications such as voice.
As such several degradations can happen to the speech signal before it reaches its
destination. Therefore, it is important for legal, commercial, and technical reasons
to measure the quality of VoIP applications accurately and non-intrusively.
Several methods were proposed to measure the speech quality: some of these
methods are subjective, others are intrusive-based while others are non-intrusive.
One of the non-intrusive methods for measuring the speech quality is the E-model
standardised by the International Telecommunication Union-Telecommunication Standardisation
Sector (ITU-T).
Although the E-model is a non-intrusive method for measuring the speech quality,
but it depends on the time-consuming, expensive and hard to conduct subjective
tests to calibrate its parameters, consequently it is applicable to a limited number
of conditions and speech coders. Also, it is less accurate than the intrusive methods
such as Perceptual Evaluation of Speech Quality (PESQ) because it does not consider
the contents of the received signal.
In this thesis an approach to extend the E-model based on PESQ is proposed.
Using this method the E-model can be extended to new network conditions and
applied to new speech coders without the need for the subjective tests. The modified
E-model calibrated using PESQ is compared with the E-model calibrated using
i
ii
subjective tests to prove its effectiveness.
During the above extension the relation between quality estimation using the
E-model and PESQ is investigated and a correction formula is proposed to correct
the deviation in speech quality estimation.
Another extension to the E-model to improve its accuracy in comparison with
the PESQ looks into the content of the degraded signal and classifies packet loss
into either Voiced or Unvoiced based on the received surrounding packets. The accuracy
of the proposed method is evaluated by comparing the estimation of the new
method that takes packet class into consideration with the measurement provided
by PESQ as a more accurate, intrusive method for measuring the speech quality.
The above two extensions for quality estimation of the E-model are combined
to offer a method for estimating the quality of VoIP applications accurately, nonintrusively
without the need for the time-consuming, expensive, and hard to conduct
subjective tests.
Finally, the applicability of the E-model or the modified E-model in measuring
the quality of services in Service Oriented Computing (SOC) is illustrated
Speech quality prediction for voice over Internet protocol networks
Merged with duplicate record 10026.1/878 on 03.01.2017 by CS (TIS). Merged with duplicate record 10026.1/1657 on 15.03.2017 by CS (TIS)This is a digitised version of a thesis that was deposited in the University Library. If you are the author please contact PEARL Admin ([email protected]) to discuss options.IP networks are on a steep slope of innovation that will make them the long-term carrier
of all types of traffic, including voice. However, such networks are not designed to support
real-time voice communication because their variable characteristics (e.g. due to delay, delay
variation and packet loss) lead to a deterioration in voice quality. A major challenge in such networks
is how to measure or predict voice quality accurately and efficiently for QoS monitoring
and/or control purposes to ensure that technical and commercial requirements are met.
Voice quality can be measured using either subjective or objective methods. Subjective
measurement (e.g. MOS) is the benchmark for objective methods, but it is slow, time consuming
and expensive. Objective measurement can be intrusive or non-intrusive. Intrusive methods
(e.g. ITU PESQ) are more accurate, but normally are unsuitable for monitoring live traffic
because of the need for a reference data and to utilise the network. This makes non-intrusive
methods(e.g. ITU E-model) more attractive for monitoring voice quality from IP network impairments.
However, current non-intrusive methods rely on subjective tests to derive model
parameters and as a result are limited and do not meet new and emerging applications.
The main goal of the project is to develop novel and efficient models for non-intrusive
speech quality prediction to overcome the disadvantages of current subjective-based methods
and to demonstrate their usefulness in new and emerging VoIP applications. The main contributions
of the thesis are fourfold:
(1) a detailed understanding of the relationships between voice quality, IP network impairments
(e.g. packet loss, jitter and delay) and relevant parameters associated with speech (e.g.
codec type, gender and language) is provided. An understanding of the perceptual effects of
these key parameters on voice quality is important as it provides a basis for the development
of non-intrusive voice quality prediction models. A fundamental investigation of the impact of
the parameters on perceived voice quality was carried out using the latest ITU algorithm for
perceptual evaluation of speech quality, PESQ, and by exploiting the ITU E-model to obtain an
objective measure of voice quality.
(2) a new methodology to predict voice quality non-intrusively was developed. The method
exploits the intrusive algorithm, PESQ, and a combined PESQ/E-model structure to provide a
perceptually accurate prediction of both listening and conversational voice quality non-intrusively.
This avoids time-consuming subjective tests and so removes one of the major obstacles in the
development of models for voice quality prediction. The method is generic and as such has
wide applicability in multimedia applications. Efficient regression-based models and robust
artificial neural network-based learning models were developed for predicting voice quality
non-intrusively for VoIP applications.
(3) three applications of the new models were investigated: voice quality monitoring/prediction
for real Internet VoIP traces, perceived quality driven playout buffer optimization and
perceived quality driven QoS control. The neural network and regression models were both
used to predict voice quality for real Internet VoIP traces based on international links. A new
adaptive playout buffer and a perceptual optimization playout buffer algorithms are presented.
A QoS control scheme that combines the strengths of rate-adaptive and priority marking control
schemes to provide a superior QoS control in terms of measured perceived voice quality is
also provided.
(4) a new methodology for Internet-based subjective speech quality measurement which
allows rapid assessment of voice quality for VoIP applications is proposed and assessed using
both objective and traditional MOS test methods
DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score
We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error (MSE). Since OSQA scores have been used widely for soundquality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create highquality output signals. However, since most OSQA scores are not analytically tractable, i.e., they are black boxes, the gradient of the objective function cannot be calculated by simply applying back-propagation. To calculate the gradient of the OSQA-based objective function, we formulated a DNN optimization scheme on the basis of black-box optimization, which is used for training a computer that plays a game. For a black-box-optimization scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate output signals using the sampling algorithm, DNNs are used to estimate the probability-density function of the output signals that maximize OSQA scores. The OSQA scores are calculated from the simulated output signals, and the DNNs are trained to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several experiments, we found that OSQA scores significantly increased by applying the proposed method, even though the MSE was not minimized
Employing Real Training Data for Deep Noise Suppression
Most deep noise suppression (DNS) models are trained with reference-based
losses requiring access to clean speech. However, sometimes an additive
microphone model is insufficient for real-world applications. Accordingly, ways
to use real training data in supervised learning for DNS models promise to
reduce a potential training/inference mismatch. Employing real data for DNS
training requires either generative approaches or a reference-free loss without
access to the corresponding clean speech. In this work, we propose to employ an
end-to-end non-intrusive deep neural network (DNN), named PESQ-DNN, to estimate
perceptual evaluation of speech quality (PESQ) scores of enhanced real data. It
provides a reference-free perceptual loss for employing real data during DNS
training, maximizing the PESQ scores. Furthermore, we use an epoch-wise
alternating training protocol, updating the DNS model on real data, followed by
PESQ-DNN updating on synthetic data. The DNS model trained with the PESQ-DNN
employing real data outperforms all reference methods employing only synthetic
training data. On synthetic test data, our proposed method excels the
Interspeech 2021 DNS Challenge baseline by a significant 0.32 PESQ points. Both
on synthetic and real test data, the proposed method beats the baseline by 0.05
DNSMOS points - although PESQ-DNN optimizes for a different perceptual metric
- …