11 research outputs found

    A Review on Emotion Recognition Algorithms using Speech Analysis

    Get PDF
    In recent years, there is a growing interest in speech emotion recognition (SER) by analyzing input speech. SER can be considered as simply pattern recognition task which includes features extraction, classifier, and speech emotion database. The objective of this paper is to provide a comprehensive review on various literature available on SER. Several audio features are available, including linear predictive coding coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), and Teager energy based features. While for classifier, many algorithms are available including hidden Markov model (HMM), Gaussian mixture model (GMM), vector quantization (VQ), artificial neural networks (ANN), and deep neural networks (DNN). In this paper, we also reviewed various speech emotion database. Finally, recent related works on SER using DNN will be discussed

    The disruptometer: an artificial intelligence algorithm for market insights

    Get PDF
    Social media data mining is rapidly developing to be a mainstream tool for marketing insights in todayโ€™s world, due to the abundance of data and often freely accessed information. In this paper, we propose a framework for market research purposes called the Disruptometer. The algorithm uses keywords to provide different types of market insights from data crawling. The preliminary algorithm data-mines information from Twitter and outputs 2 parameters โ€“ Product-to-Market Fit and Disruption Quotient, which is obtained from a brandโ€™s customer value proposition, problem space, and incumbent space. The algorithm has been tested with a venture capitalist portfolio company and market research firm to show high correlated results. Out of 4 brand use cases, 3 obtained identical results with the analysts โ€˜studies

    Speech emotion recognition using deep feedforward neural network

    No full text
    Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions

    Speech emotion recognition using deep feedforward neural network

    No full text
    Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions

    A review on emotion recognition algorithms using speech analysis

    No full text
    In recent years, there is a growing interest in speech emotion recognition (SER) by analyzing input speech. SER can be considered as simply pattern recognition task which includes features extraction, classifier, and speech emotion database. The objective of this paper is to provide a comprehensive review on various literature available on SER. Several audio features are available, including linear predictive coding coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), and Teager energy based features. While for classifier, many algorithms are available including hidden Markov model (HMM), Gaussian mixture mdoel (GMM), vector quantization (VQ), artificial neural networks (ANN), and deep neural networks (DNN). In this paper, we also reviewed various speech emotion database. Finally, recent related works on SER using DNN will be discussed

    On the optimum speech segment length for depression detection

    No full text
    Depression is a worldwide problem, which according to the World Health Organization, is the largest contributor to global disability. According to a study, around 18336 Malaysians are suffering from depression. Therefore, an automated system that can detect depression from human speech is needed. The main objective of this paper is to investigate the optimum speech segment length that provide fast and accurate depression detection. An artificial neural network was used as classifier to detect depression using a speech feature, i.e. the averaged Mel-frequency cepstral coefficients (MFCC). The Distress Analysis Interview Corpus Wizard of Oz (DAIC-WOZ) was used to train and test the system, measured in terms of accuracy and processing time, while varying the number of neurons used. The obtained results are further optimized by investigating the ideal segment length for depression detection. Results showed that our proposed system can recognize voiced depression in 3 levels of depression with an accuracy rate up to 98.3% when given previous samples of the same speaker for training. Furthermore, the optimum speech segment length was found to be 7 seconds, when it is tested for the length between 1 to 20 seconds

    On the effect of feature compression on speech emotion recognition across multiple languages

    No full text
    The ability of computers to recognize emotions from the speech is commonly termed as speech emotion recognition (SER). While in recent years, many studies have been performed, the golden standard has yet to be achieved due to many pa-rameters to consider. In this study, we investigate the effect of speech feature compression of Mel-frequency Cepstral Coefficient (MFCC) across four lan-guages โ€“ English, French, German, and Italian. The classification was performed using a deep feedforward network. The proposed methodology has shown to have significant results when tested using a network which was trained in the same language, and up to an accuracy rate of 80.8% when trained using all four languages

    A critical insight into multi-languages speech emotion databases

    No full text
    With increased interest of human-computer/human-human interactions, systems deducing and identifying emotional aspects of a speech signal has emerged as a hot research topic. Recent researches are directed towards the development of automated and intelligent analysis of human utterances. Although numerous researches have been put into place for designing systems, algorithms, classifiers in the related field; however the things are far from standardization yet. There still exists considerable amount of uncertainty with regard to aspects such as determining influencing features, better performing algorithms, number of emotion classification etc. Among the influencing factors, the uniqueness between speech databases such as data collection method is accepted to be significant among the research community. Speech emotion database is essentially a repository of varied human speech samples collected and sampled using a specified method. This paper reviews 34 `speech emotion databases for their characteristics and specifications. Furthermore, critical insight into the imitational aspects for the same have also been highlighted

    Comparative analysis of gender identification using speech analysis and higher order statistics

    No full text
    Gender identification via speech processing is one of the hot research topics among the security research community. Many cyber systems are being developed to recognize human speech type. These systems mainly comprise of a feature segment process which extracts and selects the features of human speeches. Feature extraction and feature selection are the most noteworthy phase of speech recognition involving numerous strategies. The purpose of this paper is to investigate the potential effectiveness of spectral analysis and higher-order statistics performed over the speech segments of different genders. Spectral analysis is done via spectral descriptors consisting of varied parameters which are widely used in machine learning applications. The varied gender speeches are distinguished by means of parameters, i.e., higher order statistics, like spectral centroid, spectral entropy, spectral kurtosis, spectral slope and spectral flatness. The results obtained show successful discrimination of male and female speeches based on the peakiness of speech, voiced and unvoiced and higher and lower formants

    On the use of voice activity detection in speech emotion recognition

    Get PDF
    Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%
    corecore