19 research outputs found

    Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

    Get PDF
    Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks

    An Improved Speech Emotion Classification Approach Based on Optimal Voiced Unit

    Get PDF
    Emotional speech recognition (ESR) has significant role in human-computer interaction. ESR methodology involves audio segmentation for selecting units to analyze, extract features relevant to emotion, and finally perform a classification process. Previous research assumed that a single utterance was the unit of analysis. They believed that the emotional state remained constant during the utterance, even though the emotional state could change over time, even within a single utterance. As a result, using an utterance as a single unit is ineffective for this purpose. The study’s goal is to discover a new voiced unit that can be utilized to improve ESR accuracy. Several voiced units based on voiced segments were investigated. To determine the best-voiced unit, each unit is evaluated using an ESR based on a support vector machine classifier. The proposed method was validated using three datasets: EMO-DB, EMOVO, and SAVEE. Experimental results revealed that a voiced unit with five-voiced segments has the highest recognition rate. The emotional state of the overall utterance is decided by a majority vote of its parts’ emotional states. The proposed method outperforms the traditional method in terms of classification outcomes. EMO-DB, EMOVO, and SAVEE improve their recognition rates by 12%, 27%, and 23%, respectively

    COVID-19 Detection on Chest x-ray Images by Combining Histogram-oriented Gradient and Convolutional Neural Network Features

    Get PDF
    The COVID-19 coronavirus epidemic has spread rapidly worldwide after a person became infected with a severe health problem. The World Health Organization has declared the coronavirus a global threat (WHO). Early detection of COVID 19, particularly in cases with no apparent symptoms, may reduce the patients mortality rate. COVID 19 detection using machine learning techniques will aid healthcare systems around the world in recovering patients more rapidly. This disease is diagnosed using x-ray images of the chest; therefore, this study proposed a machine vision method for detecting COVID-19 in x-ray images of the chest. The histogram-oriented gradient (HOG) and convolutional neural network (CNN) features extracted from x-ray images were fused and classified using support vector machine (SVM) and softmax. The proposed feature fusion technique (99.36 percent) outperformed individual feature extraction methods such as HOG (87.34 percent) and CNN (93.64 percent)

    Feature Selection Method for Real-time Speech Emotion Recognition

    Get PDF
    Feature selection is very important step to improve the accuracy of speech emotion recognition for many applications such as speech-to-speech translation system. Thousands of features can be extracted from speech signal however which features are the most related for speaker emotional state. Until now most of related features to emotional states are not yet found. The purpose of this paper is to propose a feature selection method which have the ability to find most related features with linear or non-linear relationship with the emotional state. Most of the previous studies used either correlation between acoustic features and emotions as for feature selection or principal component analysis (PCA) as a feature reduction method. These traditional methods does not reflect all types of relations between acoustic features and emotional state. They only can find the features which have a linear relationship. However, the relationship between any two variables can be linear, nonlinear or fuzzy. Therefore, the feature selection method should consider these kind of relationship between acoustic features and emotional state. Therefore, a feature selection method based on fuzzy inference system (FIS) was proposed. The proposed method can find all features which have any kind of above mentioned relationships. Then A FIS was used to estimate emotion dimensions valence and activations. Third FIS was used to map the values of estimated valence and activation to emotional category. The experimental results reveal that the proposed features selection method outperforms the traditional methods

    Optimizing Fuzzy Inference Systems for Improving Speech Emotion Recognition

    Get PDF
    Fuzzy Inference System (FIS) is used for pattern recognition and classification purposes in many fields such as emotion recognition.However, the performance of FIS is highly dependent on the radius of clusters which has a very important role for its recognition accuracy. Although many researcher initialize this parameter randomly which does not grantee the best performance of their systems. The purpose of thispaper is to optimize FIS parameters in order to construct a high efficient system for speech emotion recognition. Therefore, a novel optimizationalgorithm based on particle swarm optimization technique is proposed for finding the best parameters of FIS classifier. In order to evaluate theproposed system it was tested using two emotional speech databases; Fujitsu and Berlin database. The simulation results show that the optimized system has high recognition accuracy for both languages with 97% recogintion acuracy for Japanese and 80% for German database.Book Title: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 201

    Improving speech emotion dimensions estimation using a three-layer model of human perception

    Get PDF
    Most previous studies using the dimensional approach mainly focused on the direct relationship between acoustic features and emotion dimensions (valence, activation, and dominance). However, the acoustic features that correlate to valence dimension are very few and very weak. As a result, the valence dimension has been particularly difficult to predict. The purpose of this research is to construct a speech emotion recognition system that has the ability to precisely estimate values of emotion dimensions especially valence. This paper proposes a three-layer model to improve the estimating values of emotion dimensions from acoustic features. The proposed model consists of three layers: emotion dimensions in the top layer, semantic primitives in the middle layer, and acoustic features in the bottom layer. First, a top-down acoustic feature selection method based on this model was conducted to select the most relevant acoustic features for each emotion dimension. Then, a button-up method was used to estimate values of emotion dimensions from acoustic features by firstly using fuzzy inference system (FIS) to estimate the degree of each semantic primitive from acoustic features, then using another FIS to estimate values of emotion dimensions from the estimated degrees of semantic primitives. The experimental results reveal that the constructed emotion recognition system based on the proposed three-layer model outperforms the conventional system

    Toward relaying emotional state for speech-to-speech translator: Estimation of emotional state for synthesizing speech with emotion

    Get PDF
    Most of the previous studies on Speech-to-Speech Translation (S2ST) focused on processing of linguistic content by directly translating the spoken utterance from the source language to the target language without taking into account the paralinguistic and non-linguistic information like emotional states emitted by the source. However, for clear communication, it is important to capture and transmit the emotional states from the source language to the target language. In order to synthesize the target speech with the emotional state conveyed at the source, a speech emotion recognition system is required to detect the emotional state of the source language. The S2ST system should enable the source and target languages to be used interchangeably, i.e. it should possess the ability to detect the emotional state of the source regardless of the language used. This paper proposes a Bilingual Speech Emotion Recognition (BSER) system for detecting the emotional state of the source language in the S2ST system. In natural speech, humans can detect the emotional states from the speech regardless of the language used. Therefore, this study demonstrates feasibility of constructing a global BSER system that has the ability to recognize universal emotions. This paper introduces a three-layer model: emotion dimensions in the top layer, semantic primitives in the middle layer, and acoustic features in the bottom layer. The experimental results reveal that the proposed system precisely estimates the emotion dimensions cross-lingual working with Japanese and German languages. The most important outcome is that, using the proposed normalization method for acoustic features, we found that emotion recognition is language independent. Therefore, this system can be extended for estimating the emotional state conveyed in the source languages in a S2ST system for several language pairs

    Cross-lingual Speech Emotion Recognition System Based on a Three-Layer Model for Human Perception

    Get PDF
    The purpose of this study is to investigate whether emotion dimensions valence, activation, and dominance can be estimated cross-lingually. Most of the previous studies for automatic speech emotion recognition were based on detecting the emotional state working on mono-language. However, in order to develop a generalized emotion recognition system, the performance of these systems must be analyzed in mono-language as well as cross-language. The ultimate goal of this study is to build a bilingual emotion recognition system that has the ability to estimate emotion dimensions from one language using a system trained using another language. In this study, we first propose a novel acoustic feature selection method based on a human perception model. The proposed model consists of three layers: emotion dimensions in the top layer, semantic primitives in the middle layer, and acoustic features in the bottom layer. The experimental results reveal that the proposed method is effective for selecting acoustic features representing emotion dimensions, working with two different databases, one in Japanese and the other in German. Finally, the common acoustic features between the two databases are used as the input to the cross-lingual emotion recognition system. Moreover, the proposed cross-lingual system based on the three-layer model performs just as well as the two separate mono-lingual systems for estimating emotion dimensions values

    Speech Emotion Recognition System Based on a Dimensional Approach Using a Three-Layered Model

    Get PDF
    This paper proposes a three-layer model for estimating the expressed emotions in a speech signal based on a dimensional approach. Most of the previous studies using the dimensional approach mainly focused on the direct relationship between acoustic features and emotion dimensions (valence, activation, and dominance). However, the acoustic features that correlate to valence dimension are less numerous, less strong, and the valence dimension has being particularly difficult to be predicted. The ultimate goal of this study is to improve the dimensional approach in order to precisely predict the valence dimension. The proposed model consists of three layers: acoustic features, semantic primitives, and emotion dimensions. We aimed to construct a three-layer model in imitation of the process of how human perceive and recognize emotions. In this study, we first investigated the correlations between the elements of the two-layered model and elements of the three-layered model. In addition, we compared the two models by applying a fuzzy inference system (FIS) to estimate emotion dimensions. In our model FIS was used to estimate semantic primitives from acoustic features, then to estimate emotion dimensions from the estimated semantic primitives. The experimental results show that the proposed three-layered model outperforms the traditional two-layered model

    Extractive Arabic Text Summarization Using Modified PageRank Algorithm

    Full text link
    This paper proposed an approach for Arabic text summarization. Text summarization is one of the natural language processing's applications which is used for reducing the original text amount and retrieving only the important information from the original text. The Arabic language has a complex morphological structure which makes it very difficult to extract nouns to be used as a feature for summarization process. Therefore, Al-Khalil morphological analyzer is used to solve the problem of nouns extraction. The proposed approach is a graph-based system, which represents the document as a graph where the vertices of the graph are the sentences. A Modified PageRank algorithm is applied with an initial score for each node that is the number of nouns in this sentence. More nouns in the sentence mean more information, so nouns count used here as initial rank for the sentence. Edges between sentences are the cosine similarity between the sentences, to get a final summary that contains sentences with more information and well connected with each other. The process of text summarization consists of three major stages: pre-processing stage, features extraction and graph construction stage, and finally applying the Modified PageRank algorithm and summary extraction. The Modified PageRank algorithm used a different number of iterations to find the number returns the best summary results, and the extracted summary depends on compression ratio, taking into account removing redundancy depending on the overlapping between the sentences. To evaluate the performance of this approach EASC Corpus is used as a standard. LexRank and TextRank algorithms were used under the same circumstances, the proposed approach provides better results when compared with other Arabic text summarization techniques. The proposed approach performs efficiently with the number of iteration 10,000
    corecore