12 research outputs found

    A Speech Quality Classifier based on Tree-CNN Algorithm that Considers Network Degradations

    Get PDF
    Many factors can affect the users’ quality of experience (QoE) in speech communication services. The impairment factors appear due to physical phenomena that occur in the transmission channel of wireless and wired networks. The monitoring of users’ QoE is important for service providers. In this context, a non-intrusive speech quality classifier based on the Tree Convolutional Neural Network (Tree-CNN) is proposed. The Tree-CNN is an adaptive network structure composed of hierarchical CNNs models, and its main advantage is to decrease the training time that is very relevant on speech quality assessment methods. In the training phase of the proposed classifier model, impaired speech signals caused by wired and wireless network degradation are used as input. Also, in the network scenario, different modulation schemes and channel degradation intensities, such as packet loss rate, signal-to-noise ratio, and maximum Doppler shift frequencies are implemented. Experimental results demonstrated that the proposed model achieves significant reduction of training time, reaching 25% of reduction in relation to another implementation based on DRBM. The accuracy reached by the Tree-CNN model is almost 95% for each quality class. Performance assessment results show that the proposed classifier based on the Tree-CNN overcomes both the current standardized algorithm described in ITU-T Rec. P.563 and the speech quality assessment method called ViSQOL

    Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions

    Get PDF
    ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose

    Coded Speech Quality Measurement by a Non-Intrusive PESQ-DNN

    Full text link
    Wideband codecs such as AMR-WB or EVS are widely used in (mobile) speech communication. Evaluation of coded speech quality is often performed subjectively by an absolute category rating (ACR) listening test. However, the ACR test is impractical for online monitoring of speech communication networks. Perceptual evaluation of speech quality (PESQ) is one of the widely used metrics instrumentally predicting the results of an ACR test. However, the PESQ algorithm requires an original reference signal, which is usually unavailable in network monitoring, thus limiting its applicability. NISQA is a new non-intrusive neural-network-based speech quality measure, focusing on super-wideband speech signals. In this work, however, we aim at predicting the well-known PESQ metric using a non-intrusive PESQ-DNN model. We illustrate the potential of this model by predicting the PESQ scores of wideband-coded speech obtained from AMR-WB or EVS codecs operating at different bitrates in noisy, tandeming, and error-prone transmission conditions. We compare our methods with the state-of-the-art network topologies of QualityNet, WaweNet, and DNSMOS -- all applied to PESQ prediction -- by measuring the mean absolute error (MAE) and the linear correlation coefficient (LCC). The proposed PESQ-DNN offers the best total MAE and LCC of 0.11 and 0.92, respectively, in conditions without frame loss, and still is best when including frame loss. Note that our model could be similarly used to non-intrusively predict POLQA or other (intrusive) metrics. Upon article acceptance, code will be provided at GitHub

    Parallel task in Subjective Audio Quality and Speech Intelligibility Assessments

    Get PDF
    Tato disertační práce se zabývá subjektivním testováním jak kvality řeči, tak i srozumitelnosti řeči, prozkoumává existující metody, určuje jejich základní principy a podstaty a porovnává jejich výhody a nevýhody. Práce také porovnává testy z hlediska různých parametrů a poskytuje moderní řešení pro již existující metody testování. První část práce se zabývá opakovatelností subjektivních testování provedených v ideálních laboratorních podmínkách. Takové úlohy opakovatelnosti se provádí použitím Pearsonové korelace, porovnání po párech a jinými matematickými analýzami. Tyto úlohy dokazují správnost postupů provedených subjektivních testů. Z tohoto důvodu byly provedeny čtyři subjektivní testy kvality řeči ve třech různých laboratořích. Získané výsledky potvrzují, že provedené testy byly vysoce opakovatelné a testovací požadavky byly striktně dodrženy. Dále byl proveden výzkum pro ověření významnosti subjektivních testování kvality řeči a srozumitelnosti řeči v komunikačních systémech. Za tímto účelem bylo analyzováno více než 16 miliónů záznamů živých hovorů přes VoIP telekomunikační sítě. Výsledky potvrdily základní předpoklad, že lepší uživatelská zkušenost působí delší trvání hovorů. Kromě dosažených hlavních výsledků však byly učiněny další důležité závěry. Dalším krokem disertační práce bylo prozkoumat techniku paralelních zátěží, existující přístupy a jejich výhody a nevýhody. Ukázalo se, že většina paralelních zátěží používaných v testech byla buď fyzicky, nebo mentálně orientovaná. Jelikož subjekty ve většině případů nejsou stejně fyzicky nebo mentálně zdatní, jejich výkony během úkolů nejsou stejné, takže výsledky nelze správně porovnat. V této disertační práci je navržen nový přístup, kdy jsou podmínky pro všechny subjekty stejné. Tento přístup představuje celou řadu úkolů, které zahrnují kombinaci mentálních a fyzických zátěží (simulátor laserové střelby, simulátor řízení auta, třídění předmětů apod.). Tyto metody byly použity v několika subjektivních testech kvality řeči a srozumitelnosti řeči. Závěry naznačují, že testy s paralelními zátěží mají realističtější výsledky než ty, které jsou prováděny v laboratorních podmínkách. Na základě výzkumu, zkušeností a dosažených výsledků byl Evropskému institutu pro normalizaci v telekomunikacích předložen nový standard s přehledem, příklady a doporučeními pro zajištění subjektivních testování kvality řeči a srozumitelnosti řeči. Standard byl přijat a publikován pod číslem ETSI TR 103 503.This thesis deals with the subjective testing of both speech quality and speech intelligibility, investigates the existing methods, record their main features, as well as advantages and disadvantages. The work also compares different tests in terms of various parameters and provides a modern solution for existing subjective testing methods. The first part of the research deals with the repeatability of subjective speech quality tests provided in perfect laboratory conditions. Such repeatability tasks are performed using Pearson correlations, pairwise comparison, and other mathematical analyses, and are meant to prove the correctness of procedures of provided subjective tests. For that reason, four subjective speech quality tests were provided in three different laboratories. The obtained results confirmed that the provided tests were highly repeatable, and the test requirements were strictly followed. Another research was done to verify the significance of speech quality and speech intelligibility tests in communication systems. To this end, more than 16 million live call records over VoIP telecommunications networks were analyzed. The results confirmed the primary assumption that better user experience brings longer call durations. However, alongside the main results, other valuable conclusions were made. The next step of the thesis was to investigate the parallel task technique, existing approaches, their advantages, and disadvantages. It turned out that the majority of parallel tasks used in tests were either physically or mentally oriented. As the subjects in most cases are not equally trained or intelligent, their performances during the tasks are not equal either, so the results could not be compared correctly. In this thesis, a novel approach is proposed where the conditions for all subjects are equal. The approach presents a variety of tasks, which include a mix of mental and physical tasks (laser-shooting simulator, car driving simulator, objects sorting, and others.). Afterward, the methods were used in several subjective speech quality and speech intelligibility tests. The results indicate that the tests with parallel tasks have more realistic values than the ones provided in laboratory conditions. Based on the research, experience, and achieved results, a new standard was submitted to the European Telecommunications Standards Institute with an overview, examples, and recommendations for providing subjective speech quality and speech intelligibility tests. The standard was accepted and published under the number ETSI TR 103 503

    Robust Speaker Verification

    Get PDF
    Cílem této práce je analyzovat úspěšnost systému rozpoznávaní mluvčího na nahrávkach degradovaných různym telefonním přenosovým kanálem. Použili jsme dva způsoby extrakce příznaků - Mel Frequency Cepstral Coefficients (MFCC) a moderní systém, který spojuje Bottleneck příznaky spolu s MFCC. Systém rozpoznávání mluvčího je založen na i-vektorech a Pravděpodobnostní Lineární Diskriminační Analýze (PLDA). Porovnali jsme scenáře, kde je PLDA trénovaná jen na čisté řeči, poté systém kde jsme přidali data s hlukem a reverberací a nakonec, data degradované kodekem. Vyhodnotili jsem systémy za rovnakých podmínek (data ze stejného kodeku byli také v trénování PLDA) a také za rozdílnych podmínek (data ze stejného kodeku resp. rodiny kodeků nebyli v trénování PLDA). Také jsme experimentovali s nedávno představenou technikou na adaptaci kanálu - Within-class Covariance Correction (WCC). Můžeme jednoznačně vidět zlepšení úspěšnosti přidáním degradovaných dat do PLDA resp. WCC (s přibližně stejným výsledkem) pro obě naše testované podmínky.The goal of this paper is to analyze the impact of codec degraded speech on a state-ofthe-art speaker recognition system. Two feature extraction techniques are analyzed - Mel Frequency Cepstral Coefficients (MFCC) and the state-of-the-art system using Bottleneck features together with MFCC. Speaker recognition system is based on i-vector and Probabilistic Linear Discriminant Analysis (PLDA). We compared scenarios where PLDA is trained only on clean data, then system where we added also noise and reverberant data, and at last, codec degraded speech. We evaluated the systems on the matched conditions (data from the same codec are seen with PLDA) and also mismatched conditions (PLDA does not see any data from the tested codec). We experimented also with recently introduced technique for channel adaptation - Within-class Covariance Correction (WCC). We can see clear benefit of adding transcoded data to PLDA or WCC (with approximately same gain) for both tested conditions (matched and mismatched).

    The Effect Of Acoustic Variability On Automatic Speaker Recognition Systems

    Get PDF
    This thesis examines the influence of acoustic variability on automatic speaker recognition systems (ASRs) with three aims. i. To measure ASR performance under 5 commonly encountered acoustic conditions; ii. To contribute towards ASR system development with the provision of new research data; iii. To assess ASR suitability for forensic speaker comparison (FSC) application and investigative/pre-forensic use. The thesis begins with a literature review and explanation of relevant technical terms. Five categories of research experiments then examine ASR performance, reflective of conditions influencing speech quantity (inhibitors) and speech quality (contaminants), acknowledging quality often influences quantity. Experiments pertain to: net speech duration, signal to noise ratio (SNR), reverberation, frequency bandwidth and transcoding (codecs). The ASR system is placed under scrutiny with examination of settings and optimum conditions (e.g. matched/unmatched test audio and speaker models). Output is examined in relation to baseline performance and metrics assist in informing if ASRs should be applied to suboptimal audio recordings. Results indicate that modern ASRs are relatively resilient to low and moderate levels of the acoustic contaminants and inhibitors examined, whilst remaining sensitive to higher levels. The thesis provides discussion on issues such as the complexity and fragility of the speech signal path, speaker variability, difficulty in measuring conditions and mitigation (thresholds and settings). The application of ASRs to casework is discussed with recommendations, acknowledging the different modes of operation (e.g. investigative usage) and current UK limitations regarding presenting ASR output as evidence in criminal trials. In summary, and in the context of acoustic variability, the thesis recommends that ASRs could be applied to pre-forensic cases, accepting extraneous issues endure which require governance such as validation of method (ASR standardisation) and population data selection. However, ASRs remain unsuitable for broad forensic application with many acoustic conditions causing irrecoverable speech data loss contributing to high error rates

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum

    Get PDF

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    L’individualità del parlante nelle scienze fonetiche: applicazioni tecnologiche e forensi

    Full text link
    corecore