27 research outputs found

    Perceptual Echo Control and Delay Estimation

    Get PDF

    An Algorithm to Evaluate the Echo Signal and the Voice Quality in VoIP Networks

    Get PDF
    Voice over the Internet Protocol (VoIP) has been increasingly popular, but reliability and voice quality remain important factors that limit the widespread adoption of VoIP systems. Providing good voice quality is of major importance for the transition from the PSTN to VoIP networks. There are several non-real-time algorithms that estimate the voice quality such as the PESQ and the E-model. In this thesis we propose a real-time fuzzy algorithm to estimate the echo quality component of the voice quality in VoIP networks. Differently from the existing algorithms, the proposed algorithm does not need a reference signal and has low computational complexity. For these reasons, the proposed algorithm can be embedded in every VoIP system of a network to monitor live calls, giving an estimate of the instantaneous voice quality to the network provider

    Analysis of voice quality problems of Voice Over Internet Protocol (VoIP)

    Get PDF
    After its introduction in mid 90s, Voice Over Internet Protocol (VoIP) or IP telephony has drawn much attention. The prospect of cost savings on long distance and international toll calls, the global presence of Internet Protocol (IP), and the trend to converge data networks with voice networks have made VoIP one of the fastest growing telecom sectors. Additionally, the emergence of 3rd Generation (3G) cellular technology which offers high bandwidth will result in the convergence of the Internet and the cellular networks which will further stimulate the growth of VoIP. However, VoIP faces many problems mainly because of the nature of IP networks which were built to transport non-real-time data unlike voice. This thesis analyzes factors affecting the voice quality of VoIP. These factors are delay, jitter, packet loss, link errors, echo and Voice Activity Detection (VAD). Further, implementation suggestions to lessen the effects of these factors are presented and finally, these suggestions are analyzed.http://archive.org/details/analysisofvoiceq109456261First Lieutenant, Turkish ArmyApproved for public release; distribution is unlimited

    Objective and Subjective Evaluation of Wideband Speech Quality

    Get PDF
    Traditional landline and cellular communications use a bandwidth of 300 - 3400 Hz for transmitting speech. This narrow bandwidth impacts quality, intelligibility and naturalness of transmitted speech. There is an impending change within the telecommunication industry towards using wider bandwidth speech, but the enlarged bandwidth also introduces a few challenges in speech processing. Echo and noise are two challenging issues in wideband telephony, due to increased perceptual sensitivity by users. Subjective and/or objective measurements of speech quality are important in benchmarking speech processing algorithms and evaluating the effect of parameters like noise, echo, and delay in wideband telephony. Subjective measures include ratings of speech quality by listeners, whereas objective measures compute a metric based on the reference and degraded speech samples. While subjective quality ratings are the gold - standard\u27\u27, they are also time- and resource- consuming. An objective metric that correlates highly with subjective data is attractive, as it can act as a substitute for subjective quality scores in gauging the performance of different algorithms and devices. This thesis reports results from a series of experiments on subjective and objective speech quality evaluation for wideband telephony applications. First, a custom wideband noise reduction database was created that contained speech samples corrupted by different background noises at different signal to noise ratios (SNRs) and processed by six different noise reduction algorithms. Comprehensive subjective evaluation of this database revealed an interaction between the algorithm performance, noise type and SNR. Several auditory-based objective metrics such as the Loudness Pattern Distortion (LPD) measure based on the Moore - Glasberg auditory model were evaluated in predicting the subjective scores. In addition, the performance of Bayesian Multivariate Regression Splines(BMLS) was also evaluated in terms of mapping the scores calculated by the objective metrics to the true quality scores. The combination of LPD and BMLS resulted in high correlation with the subjective scores and was used as a substitution for fine - tuning the noise reduction algorithms. Second, the effect of echo and delay on the wideband speech was evaluated in both listening and conversational context, through both subjective and objective measures. A database containing speech samples corrupted by echo with different delay and frequency response characteristics was created, and was later used to collect subjective quality ratings. The LPD - BMLS objective metric was then validated using the subjective scores. Third, to evaluate the effect of echo and delay in conversational context, a realtime simulator was developed. Pairs of subjects conversed over the simulated system and rated the quality of their conversations which were degraded by different amount of echo and delay. The quality scores were analysed and LPD+BMLS combination was found to be effective in predicting subjective impressions of quality for condition-averaged data

    Quality of media traffic over Lossy internet protocol networks: Measurement and improvement.

    Get PDF
    Voice over Internet Protocol (VoIP) is an active area of research in the world of communication. The high revenue made by the telecommunication companies is a motivation to develop solutions that transmit voice over other media rather than the traditional, circuit switching network. However, while IP networks can carry data traffic very well due to their besteffort nature, they are not designed to carry real-time applications such as voice. As such several degradations can happen to the speech signal before it reaches its destination. Therefore, it is important for legal, commercial, and technical reasons to measure the quality of VoIP applications accurately and non-intrusively. Several methods were proposed to measure the speech quality: some of these methods are subjective, others are intrusive-based while others are non-intrusive. One of the non-intrusive methods for measuring the speech quality is the E-model standardised by the International Telecommunication Union-Telecommunication Standardisation Sector (ITU-T). Although the E-model is a non-intrusive method for measuring the speech quality, but it depends on the time-consuming, expensive and hard to conduct subjective tests to calibrate its parameters, consequently it is applicable to a limited number of conditions and speech coders. Also, it is less accurate than the intrusive methods such as Perceptual Evaluation of Speech Quality (PESQ) because it does not consider the contents of the received signal. In this thesis an approach to extend the E-model based on PESQ is proposed. Using this method the E-model can be extended to new network conditions and applied to new speech coders without the need for the subjective tests. The modified E-model calibrated using PESQ is compared with the E-model calibrated using i ii subjective tests to prove its effectiveness. During the above extension the relation between quality estimation using the E-model and PESQ is investigated and a correction formula is proposed to correct the deviation in speech quality estimation. Another extension to the E-model to improve its accuracy in comparison with the PESQ looks into the content of the degraded signal and classifies packet loss into either Voiced or Unvoiced based on the received surrounding packets. The accuracy of the proposed method is evaluated by comparing the estimation of the new method that takes packet class into consideration with the measurement provided by PESQ as a more accurate, intrusive method for measuring the speech quality. The above two extensions for quality estimation of the E-model are combined to offer a method for estimating the quality of VoIP applications accurately, nonintrusively without the need for the time-consuming, expensive, and hard to conduct subjective tests. Finally, the applicability of the E-model or the modified E-model in measuring the quality of services in Service Oriented Computing (SOC) is illustrated

    A MODEL FOR PREDICTING THE PERFORMANCE OF IP VIDEOCONFERENCING

    Get PDF
    With the incorporation of free desktop videoconferencing (DVC) software on the majority of the world's PCs, over the recent years, there has, inevitably, been considerable interest in using DVC over the Internet. The growing popularity of DVC increases the need for multimedia quality assessment. However, the task of predicting the perceived multimedia quality over the Internet Protocol (IP) networks is complicated by the fact that the audio and video streams are susceptible to unique impairments due to the unpredictable nature of IP networks, different types of task scenarios, different levels of complexity, and other related factors. To date, a standard consensus to define the IP media Quality of Service (QoS) has yet to be implemented. The thesis addresses this problem by investigating a new approach to assess the quality of audio, video, and audiovisual overall as perceived in low cost DVC systems. The main aim of the thesis is to investigate current methods used to assess the perceived IP media quality, and then propose a model which will predict the quality of audiovisual experience from prevailing network parameters. This thesis investigates the effects of various traffic conditions, such as, packet loss, jitter, and delay and other factors that may influence end user acceptance, when low cost DVC is used over the Internet. It also investigates the interaction effects between the audio and video media, and the issues involving the lip sychronisation error. The thesis provides the empirical evidence that the subjective mean opinion score (MOS) of the perceived multimedia quality is unaffected by lip synchronisation error in low cost DVC systems. The data-gathering approach that is advocated in this thesis involves both field and laboratory trials to enable the comparisons of results between classroom-based experiments and real-world environments to be made, and to provide actual real-world confirmation of the bench tests. The subjective test method was employed since it has been proven to be more robust and suitable for the research studies, as compared to objective testing techniques. The MOS results, and the number of observations obtained, have enabled a set of criteria to be established that can be used to determine the acceptable QoS for given network conditions and task scenarios. Based upon these comprehensive findings, the final contribution of the thesis is the proposal of a new adaptive architecture method that is intended to enable the performance of IP based DVC of a particular session to be predicted for a given network condition

    Internet Telephony : optimizing protocols, packet recovery, and packet size

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (leaves 104-106).by Grant Ho.M.Eng

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
    corecore