196 research outputs found

    Automatic Speech Emotion Recognition Using Machine Learning

    Get PDF
    This chapter presents a comparative study of speech emotion recognition (SER) systems. Theoretical definition, categorization of affective state and the modalities of emotion expression are presented. To achieve this study, an SER system, based on different classifiers and different methods for features extraction, is developed. Mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS) features are extracted from the speech signals and used to train different classifiers. Feature selection (FS) was applied in order to seek for the most relevant feature subset. Several machine learning paradigms were used for the emotion classification task. A recurrent neural network (RNN) classifier is used first to classify seven emotions. Their performances are compared later to multivariate linear regression (MLR) and support vector machines (SVM) techniques, which are widely used in the field of emotion recognition for spoken audio signals. Berlin and Spanish databases are used as the experimental data set. This study shows that for Berlin database all classifiers achieve an accuracy of 83% when a speaker normalization (SN) and a feature selection are applied to the features. For Spanish database, the best accuracy (94 %) is achieved by RNN classifier without SN and with FS

    Comprehensive Study of Automatic Speech Emotion Recognition Systems

    Get PDF
    Speech emotion recognition (SER) is the technology that recognizes psychological characteristics and feelings from the speech signals through techniques and methodologies. SER is challenging because of more considerable variations in different languages arousal and valence levels. Various technical developments in artificial intelligence and signal processing methods have encouraged and made it possible to interpret emotions.SER plays a vital role in remote communication. This paper offers a recent survey of SER using machine learning (ML) and deep learning (DL)-based techniques. It focuses on the various feature representation and classification techniques used for SER. Further, it describes details about databases and evaluation metrics used for speech emotion recognition

    Optimization of automatic speech emotion recognition systems

    Get PDF
    Osnov za uspešnu integraciju emocionalne inteligencije u sofisticirane sisteme veštačke inteligencije jeste pouzdano prepoznavanje emocionalnih stanja, pri čemu se paralingvistički sadržaj govora izdvaja kao posebno značajan nosilac informacija o emocionalnom stanju govornika. U ovom radu je sprovedena komparativna analiza obeležja govornog signala i klasifikatorskih metoda najčešće korišćenih u rešavanju zadatka automatskog prepoznavanja emocionalnih stanja govornika, a zatim su razmotrene mogućnosti popravke performansi sistema za automatsko prepoznavanje govornih emocija. Izvršeno je unapređenje diskretnih skrivenih Markovljevih modela upotrebom QQ krive za potrebe određivanja etalona vektorske kvantizacije, a razmotrena su i dodatna unapređenja modela. Ispitane su mogućnosti vernije reprezentacije govornog signala, pri čemu je analiza proširena na veliki broj obeležja iz različitih grupa. Formiranje velikih skupova obeležja nameće potrebu za redukcijom dimenzija, gde je pored poznatih metoda analizirana i alternativna metoda zasnovana na Fibonačijevom nizu brojeva. Na kraju su razmotrene mogućnosti integracije prednosti različitih pristupa u jedinstven sistem za automatsko prepoznavanje govornih emocija, tako da je predložena paralelna multiklasifikatorska struktura sa kombinatornim pravilom koje pored rezultata klasifikacije pojedinačnih klasifikatora ansambla koristi i informacije o karakteristikama klasifikatora. Takođe, dat je predlog automatskog formiranja ansambla klasifikatora proizvoljne veličine upotrebom redukcije dimenzija zasnovane na Fibonačijevom nizu brojevaThe basis for the successful integration of emotional intelligence into sophisticated systems of artificial intelligence is the reliable recognition of emotional states, with the paralinguistic content of speech standing out as a particularly significant carrier of information regarding the emotional state of the speaker. In this paper, a comparative analysis of speech signal features and classification methods most often used for solving the task of automatic recognition of speakers' emotional states is performed, after which the possibilities for improving the performances of the systems for automatic recognition of speech emotions are considered. Discrete hidden Markov models were improved using the QQ plot for the purpose of determining the codevectors for vector quantization, and additional models improvements were also considered. The possibilities for a more faithful representation of the speech signal were examined, whereby the analysis was extended to a large number of features from different groups. The formation of big sets of features imposes the need for dimensionality reduction, where an alternative method based on the Fibonacci sequence of numbers was analyzed, alongside known methods. Finally, the possibilities for integrating the advantages of different approaches into a single system for automatic recognition of speech emotions are considered, so that a parallel multiclassifier structure is proposed with a combinatorial rule, which, in addition to the classification results of individual ensemble classifiers, uses information about classifiers' characteristics. A proposal is also given for the automatic formation of an ensemble of classifiers of arbitrary size by using dimensionality reduction based on the Fibonacci sequence of numbers

    Automatic Speech Emotion Recognition- Feature Space Dimensionality and Classification Challenges

    Get PDF
    In the last decade, research in Speech Emotion Recognition (SER) has become a major endeavour in Human Computer Interaction (HCI), and speech processing. Accurate SER is essential for many applications, like assessing customer satisfaction with quality of services, and detecting/assessing emotional state of children in care. The large number of studies published on SER reflects the demand for its use. The main concern of this thesis is the investigation of SER from a pattern recognition and machine learning points of view. In particular, we aim to identify appropriate mathematical models of SER and examine the process of designing automatic emotion recognition schemes. There are major challenges to automatic SER including ambiguity about the list/definition of emotions, the lack of agreement on a manageable set of uncorrelated speech-based emotion relevant features, and the difficulty of collected emotion-related datasets under natural circumstances. We shall initiate our work by dealing with the identification of appropriate sets of emotion related features/attributes extractible from speech signals as considered from psychological and computational points of views. We shall investigate the use of pattern-recognition approaches to remove redundancies and achieve compactification of digital representation of the extracted data with minimal loss of information. The thesis will include the design of new or complement existing SER schemes and conduct large sets of experiments to empirically test their performances on different databases, identify advantages, and shortcomings of using speech alone for emotion recognition. Existing SER studies seem to deal with the ambiguity/dis-agreement on a “limited” number of emotion-related features by expanding the list from the same speech signal source/sites and apply various feature selection procedures as a mean of reducing redundancies. Attempts are made to discover more relevant features to emotion from speech. One of our investigations focuses on proposing a newly sets of features for SER, extracted from Linear Predictive (LP)-residual speech. We shall demonstrate the usefulness of the proposed relatively small set of features by testing the performance of an SER scheme that is based on fusing our set of features with the existing set of thousands of features using common machine learning schemes of Support Vector Machine (SVM) and Artificial Neural Network (ANN). The challenge of growing dimensionality of SER feature space and its impact on increased model complexity is another major focus of our research project. By studying the pros and cons of the commonly used feature selection approaches, we argued in favour of meta-feature selection and developed various methods in this direction, not only to reduce dimension, but also to adapt and de-correlate emotional feature spaces for improved SER model recognition accuracy. We used rincipal Component Analysis (PCA) and proposed Data Independent PCA (DIPCA) by training on independent emotional and non-emotional datasets. The DIPCA projections, especially when extracted from speech data coloured with different emotions or from Neutral speech data, had comparable capability to the PCA in terms of SER performance. Another adopted approach in this thesis for dimension reduction is the Random Projection (RP) matrices, independent of training data. We have shown that some versions of RP with SVM classifier can offer an adaptation space for Speaker Independent SER that avoid over-fitting and hence improves recognition accuracy. Using PCA trained on a set of data, while testing on emotional data features, has significant implication for machine learning in general. The thesis other major contribution focuses on the classification aspects of SER. We investigate the drawbacks of the well-known SVM classifier when applied to a preprocessed data by PCA and RP. We shall demonstrate the advantages of using the Linear Discriminant Classifier (LDC) instead especially for PCA de-correlated metafeatures. We initiated a variety of LDC-based ensembles classification, to test performance of scheme using a new form of bagging different subsets of metafeature subsets extracted by PCA with encouraging results. The experiments conducted were applied on two benchmark datasets (Emo-Berlin and FAU-Aibo), and an in-house dataset in the Kurdish language. Recognition accuracy achieved by are significantly higher than the state of art results on all datasets. The results, however, revealed a difficult challenge in the form of persisting wide gap in accuracy over different datasets, which cannot be explained entirely by the differences between the natures of the datasets. We conducted various pilot studies that were based on various visualizations of the confusion matrices for the “difficult” databases to build multi-level SER schemes. These studies provide initial evidences to the presence of more than one “emotion” in the same portion of speech. A possible solution may be through presenting recognition accuracy in a score-based measurement like the spider chart. Such an approach may also reveal the presence of Doddington zoo phenomena in SER

    A survey on the semi supervised learning paradigm in the context of speech emotion recognition

    Get PDF
    The area of Automatic Speech Emotion Recognition has been a hot topic for researchers for quite some time now. The recent breakthroughs on technology in the field of Machine Learning open up doors for multiple approaches of many kinds. However, some concerns have been persistent throughout the years where we highlight the design and collection of data. Proper annotation of data can be quite expensive and sometimes not even viable, as specialists are often needed for such a complex task as emotion recognition. The evolution of the semi supervised learning paradigm tries to drag down the high dependency on labelled data, potentially facilitating the design of a proper pipeline of tasks, single or multi modal, towards the final objective of the recognition of the human emotional mental state. In this paper, a review of the current single modal (audio) Semi Supervised Learning state of art is explored as a possible solution to the bottlenecking issues mentioned, as a way of helping and guiding future researchers when getting to the planning phase of such task, where many positive aspects from each piece of work can be drawn and combined.This work has been supported by FCT - Fundação para a Ciencia e Tecnologia within the R&D Units Project Scope: UIDB/00319/202

    Development of a Real-time Embedded System for Speech Emotion Recognition

    Get PDF
    Speech emotion recognition is one of the latest challenges in speech processing and Human Computer Interaction (HCI) in order to address the operational needs in real world applications. Besides human facial expressions, speech has proven to be one of the most promising modalities for automatic human emotion recognition. Speech is a spontaneous medium of perceiving emotions which provides in-depth information related to different cognitive states of a human being. In this context, we introduce a novel approach using a combination of prosody features (i.e. pitch, energy, Zero crossing rate), quality features (i.e. Formant Frequencies, Spectral features etc.), derived features ((i.e.) Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding Coefficients (LPCC)) and dynamic feature (Mel-Energy spectrum dynamic Coefficients (MEDC)) for robust automatic recognition of speaker’s emotional states. Multilevel SVM classifier is used for identification of seven discrete emotional states namely angry, disgust, fear, happy, neutral, sad and surprise in ‘Five native Assamese Languages’. The overall experimental results using MATLAB simulation revealed that the approach using combination of features achieved an average accuracy rate of 82.26% for speaker independent cases. Real time implementation of this algorithm is prepared on ARM CORTEX M3 board
    corecore