5 research outputs found

    Voice Analysis for Stress Detection and Application in Virtual Reality to Improve Public Speaking in Real-time: A Review

    Full text link
    Stress during public speaking is common and adversely affects performance and self-confidence. Extensive research has been carried out to develop various models to recognize emotional states. However, minimal research has been conducted to detect stress during public speaking in real time using voice analysis. In this context, the current review showed that the application of algorithms was not properly explored and helped identify the main obstacles in creating a suitable testing environment while accounting for current complexities and limitations. In this paper, we present our main idea and propose a stress detection computational algorithmic model that could be integrated into a Virtual Reality (VR) application to create an intelligent virtual audience for improving public speaking skills. The developed model, when integrated with VR, will be able to detect excessive stress in real time by analysing voice features correlated to physiological parameters indicative of stress and help users gradually control excessive stress and improve public speaking performanceComment: 41 pages, 7 figures, 4 table

    Multinomial logistic regression probability ratio-based feature vectors for Malay vowel recognition

    Get PDF
    Vowel Recognition is a part of automatic speech recognition (ASR) systems that classifies speech signals into groups of vowels. The performance of Malay vowel recognition (MVR) like any multiclass classification problem depends largely on Feature Vectors (FVs). FVs such as Mel-frequency Cepstral Coefficients (MFCC) have produced high error rates due to poor phoneme information. Classifier transformed probabilistic features have proved a better alternative in conveying phoneme information. However, the high dimensionality of the probabilistic features introduces additional complexity that deteriorates ASR performance. This study aims to improve MVR performance by proposing an algorithm that transforms MFCC FVs into a new set of features using Multinomial Logistic Regression (MLR) to reduce the dimensionality of the probabilistic features. This study was carried out in four phases which are pre-processing and feature extraction, best regression coefficients generation, feature transformation, and performance evaluation. The speech corpus consists of 1953 samples of five Malay vowels of /a/, /e/, /i/, /o/ and /u/ recorded from students of two public universities in Malaysia. Two sets of algorithms were developed which are DBRCs and FELT. DBRCs algorithm determines the best regression coefficients (DBRCs) to obtain the best set of regression coefficients (RCs) from the extracted 39-MFCC FVs through resampling and data swapping approach. FELT algorithm transforms 39-MFCC FVs using logistic transformation method into FELT FVs. Vowel recognition rates of FELT and 39-MFCC FVs were compared using four different classification techniques of Artificial Neural Network, MLR, Linear Discriminant Analysis, and k-Nearest Neighbour. Classification results showed that FELT FVs surpass the performance of 39-MFCC FVs in MVR. Depending on the classifiers used, the improved performance of 1.48% - 11.70% was attained by FELT over MFCC. Furthermore, FELT significantly improved the recognition accuracy of vowels /o/ and /u/ by 5.13% and 8.04% respectively. This study contributes two algorithms for determining the best set of RCs and generating FELT FVs from MFCC. The FELT FVs eliminate the need for dimensionality reduction with comparable performances. Furthermore, FELT FVs improved MVR for all the five vowels especially /o/ and /u/. The improved MVR performance will spur the development of Malay speech-based systems, especially for the Malaysian community

    Інтелектуальна система класифікації типів аудіоконтенту

    Get PDF
    У роботі проведено дослідження класифікації типів аудіоконтенту та розроблено інтелектуальну систему класифікації довгих аудіоданих, розглянуто типи радіоконтенту, процес вилучення аудіофіч, розглянуто особливості існуючих підходів до класифікації аудіоконтенту та попередня обробка звукового сигналу та методи покращення вилучених ознак. Об’єктом дослідження є класифікація типів аудіоконтенту. Було виконано всі кроки для переходу нейронних мереж від обробки даних до класифікації кінцевих ознак. Отримано результати тестів для двох різних розмічених наборів даних, створених за допомогою бібліотеки на мові python, було класифіковано їх із зміненими параметрами для роботи з довгим аудіоконтентом за допомогою інтерфейсу 8M, проведено аналіз отриманих результатів, їх точність та визначено необхідні зміни для покращення інтелектуальної системи. Розмір пояснювальної записки – 87 аркушів, містить 35 ілюстрацій, 23 таблиці, 6 додатків.In this work, the research conducted on the classification of types of audio content. The intellectual system of classification of long audio data was developed. Types of radio content, process of extraction of audio files was considered in this work. The features of existing approaches to classification of audio content and preliminary processing of an audio signal and methods of improvement of the removed features was considered in this explanatory note. The object of research is the classification of audio content’s types. All steps were performed to transition neural networks from data processing to classification of final features. The results of tests for two different marked data sets created with the additional feature extractor library in python, they were classified with changed parameters for working with long audio content using the 8M interface, the results were analyzed, their accuracy and necessary changes to improve intelligent system. Explanatory note size - 87 pages, contains 35 illustrations, 23 tables, 6 applications

    Automatic identification of brazilian regional accents based on statistical modeling and machine learning techniques

    Get PDF
    Orientadores: Lee Luan Ling, Tiago Fernandes TavaresDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: O sinal de fala possui características linguísticas fortemente determinadas por aspectos geográficos (região de origem), sociais e étnicos, tais como dialetos e sotaques. Eles estão diretamente relacionados a um idioma pois são compostos por estruturas fonéticas e fonológicas que são intrínsecas e que os diferenciam dos demais. Diversos estudos desenvolvidos na literatura de processamento de sinais de fala tem como finalidade modelar as variações da fala em sistemas de reconhecimento. A partir desses estudos, há a hipótese de que a classificação das variações linguísticas melhora a acurácia e permite a construção de modelos linguísticos mais adaptados às aplicações reais. Aplicações forenses e Speech to Text são exemplos de casos reais de sistemas de reconhecimento de fala. Em geral, o desempenho de sistemas de reconhecimento é mensurado em cenário de avaliação closed-set como também em cenário de teste cross datasets. Experimentos reportados na literatura consideram o caso mais fácil de avaliação, o closed-set. Neste cenário, as classes de treinamento são as mesmas utilizadas para teste. O cenário de teste cross datasets, consiste em treinar e testar o reconhecimento em duas bases de dados diferentes e independentes, sem controle sobre as condições de captura e gravação. Este último melhor se aplica em casos reais de identificação. Neste trabalho, são aplicadas técnicas de reconhecimento de padrões para a identificação das variações regionais da fala do português brasileiro. O objetivo é identificar automaticamente os sotaques brasileiros usando modelos GMM-UBM, iVectors e GMM-SVM. Além de avaliar os sistemas em um cenário closed-set, conforme outros trabalhos descritos na literatura, também analisamos a acurácia em cenários de teste cross datasets. Para execução dos experimentos, utilizamos três bases de dados diferentes, todas em português brasileiro e, como uma das contribuições deste trabalho, desenvolvemos uma base de dados de fala que contempla parte da variação na fala do português brasileiroAbstract: The speech signal has linguistic characteristics strongly determined by geographical (region of origin), social and ethnic aspects, such as dialects and accents. These characteristics are directly related to a language because they have inherent phonetic and phonological structures which differentiate them from the others. Several studies developed in the literature of speech signal processing have the purpose of modeling regional speech variations for speech recognition systems, in order to establish a hypothesis that the classification of the linguistic variations can improves the recognition accuracy and achieve some linguistic models more suitable for the real applications that includes forensic applications and speech to text conversion. As known, the performance of recognition systems is measured in the closed-set evaluation scenario in which, the training and testing data belongs to a common database. Experiments reported in the literature consider the easiest case to evaluate, the closed-set. However, the realistic performance of a recognition system can be performed under a cross data set scenario, in which the training and testing data belongs to different and independent databases without control over capture and recording conditions. In this work, we study some speech pattern recognition techniques to identify the regional variations of Brazilian Portuguese speech. The goal is to automatically identify the Brazilian regional accents using GMM-UBM, iVectors and GMM-SVM models. We evaluate the accent recognition systems under both closed-set and cross data sets scenarios. To perform the experiments we used three different Brazilian Portuguese databases. In fact, one of the major contributions of this work, is the compilation of a new speech database (Braccent), which explicitly expose the linguistic diversity of Brazilian PortugueseMestradoTelecomunicações e TelemáticaMestra em Engenharia ElétricaCAPE

    Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations

    No full text
    The goal of this study is to investigate advanced signal processing approaches [single frequency filtering (SFF) and zero-time windowing (ZTW)] with modern deep neural networks (DNNs) [convolution neural networks (CNNs), temporal convolution neural networks (TCN), time-delay neural network (TDNN), and emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN)] for dialect classification of major dialects of English. Previous studies indicated that SFF and ZTW methods provide higher spectro-temporal resolution. To capture the intrinsic variations in articulations among dialects, four feature representations [spectrogram (SPEC), cepstral coefficients, mel filter-bank energies, and mel-frequency cepstral coefficients (MFCCs)] are derived from SFF and ZTW methods. Experiments with and without data augmentation using CNN classifiers revealed that the proposed features performed better than baseline short-time Fourier transform (STFT)-based features on the UT-Podcast database [Hansen, J. H., andLiu, G. (2016). "Unsupervised accent classification for deep data fusion of accent and language information," Speech Commun. 78, 19-33]. Even without data augmentation, all the proposed features showed an approximate improvement of 15%-20% (relative) over best baseline (SPEC-STFT) feature. TCN, TDNN, and ECAPA-TDNN classifiers that capture wider temporal context further improved the performance for many of the proposed and baseline features. Among all the baseline and proposed features, the best performance is achieved with single frequency filtered cepstral coefficients for TCN (81.30%), TDNN (81.53%), and ECAPA-TDNN (85.48%). An investigation of data-driven filters, instead of fixed mel-scale, improved the performance by 2.8% and 1.4% (relatively) for SPEC-STFT and SPEC-SFF, and nearly equal for SPEC-ZTW. To assist related work, we have made the code available ([Kethireddy, R., and Kadiri, S. R. (2022). "Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations," https://github.com/r39ashmi/e2e_dialect (Last viewed 21 December 2021)].).Peer reviewe
    corecore