1,208 research outputs found

    Learning spectro-temporal features with 3D CNNs for speech emotion recognition

    Get PDF
    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.Comment: ACII, 2017, San Antoni

    Comprehensive Study of Automatic Speech Emotion Recognition Systems

    Get PDF
    Speech emotion recognition (SER) is the technology that recognizes psychological characteristics and feelings from the speech signals through techniques and methodologies. SER is challenging because of more considerable variations in different languages arousal and valence levels. Various technical developments in artificial intelligence and signal processing methods have encouraged and made it possible to interpret emotions.SER plays a vital role in remote communication. This paper offers a recent survey of SER using machine learning (ML) and deep learning (DL)-based techniques. It focuses on the various feature representation and classification techniques used for SER. Further, it describes details about databases and evaluation metrics used for speech emotion recognition

    Learning spectral-temporal features with 3D CNNs for speech emotion recognition

    Get PDF
    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions

    Speech emotion recognition with artificial intelligence for contact tracing in the COVIDā€19 pandemic

    Get PDF
    If understanding sentiments is already a difficult task in humanā€human communication, this becomes extremely challenging when a humanā€computer interaction happens, as for instance in chatbot conversations. In this work, a machine learning neural networkā€based Speech Emotion Recognition system is presented to perform emotion detection in a chatbot virtual assistant whose task was to perform contact tracing during the COVIDā€19 pandemic. The system was tested on a novel dataset of audio samples, provided by the company Blu Pantheon, which developed virtual agents capable of autonomously performing contacts tracing for individuals positive to COVIDā€19. The dataset provided was unlabelled for the emotions associated to the conversations. Therefore, the work was structured using a sort of transfer learning strategy. First, the model is trained using the labelled and publicly available Italianā€language dataset EMOVO Corpus. The accuracy achieved in testing phase reached 92%. To the best of their knowledge, thiswork represents the first example in the context of chatbot speech emotion recognition for contact tracing, shedding lights towards the importance of the use of such techniques in virtual assistants and chatbot conversational contexts for psychological human status assessment. The code of this work was publicly released at: https://github.com/fp1acm8/SE

    Automatic Speech Recognition System to Analyze Autism Spectrum Disorder in Young Children

    Get PDF
    It's possible to learn things about a person just by listening to their voice. When trying to construct an abstract concept of a speaker, it is essential to extract significant features from audio signals that are modulation-insensitive. This research assessed how individuals with autism spectrum disorder (ASD) recognize and recall voice identity. Autism spectrum disorder is the abbreviation for autism spectrum disorder. Both the ASD group and the control group performed equally well in a task in which they were asked to choose the name of a newly-learned speaker based on his or her voice. However, the ASD group outperformed the control group in a subsequent familiarity test in which they were asked to differentiate between previously trained voices and untrained voices. Persons with ASD classified voices numerically according to the exact acoustic characteristics, whereas non - autistic individuals classified voices qualitatively depending on the acoustic patterns associated to the speakers' physical and psychological traits. Child vocalizations show potential as an objective marker of developmental problems like Autism. In typical detection systems, hand-crafted acoustic features are input into a discriminative classifier, but its accuracy and resilience are limited by the number of its training data. This research addresses using CNN-learned feature representations to classify children's speech with developmental problems. On the Child Pathological and Emotional Speech database, we compare several acoustic feature sets. CNN-based approaches perform comparably to conventional paradigms in terms of unweighted average recall

    Multimodal Sentiment Sensing and Emotion Recognition Based on Cognitive Computing Using Hidden Markov Model with Extreme Learning Machine

    Get PDF
    In today's competitive business environment, exponential increase of multimodal content results in a massive amount of shapeless data. Big data that is unstructured has no specific format or organisation and can take any form, including text, audio, photos, and video. Many assumptions and algorithms are generally required to recognize different emotions as per literature survey, and the main focus for emotion recognition is based on single modality, such as voice, facial expression and bio signals. This paper proposed the novel technique in multimodal sentiment sensing with emotion recognition using artificial intelligence technique. Here the audio and visual data has been collected based on social media review and classified using hidden Markov model based extreme learning machine (HMM_ExLM). The features are trained using this method. Simultaneously, these speech emotional traits are suitably maximised. The strategy of splitting areas is employed in the research for expression photographs and various weights are provided to each area to extract information. Speech as well as facial expression data are then merged using decision level fusion and speech properties of each expression in region of face are utilized to categorize. Findings of experiments show that combining features of speech and expression boosts effect greatly when compared to using either speech or expression alone. In terms of accuracy, recall, precision, and optimization level, a parametric comparison was made

    Integrated Approach for Emotion Detection via Speech and Text Analysis

    Get PDF
    This paper aims to provide a comprehensive solution for effective reviews using deep learning models. Customers often have difficulty to find accurate reviews of the things they are interested in. The proposed framework implements a review mechanism to address this problem, which will give customers relevant reviews based on video reviews supplied in the product description. The goal of this system is to turn video reviews into a particular rating so that viewers may get a summary of the review without having to watch the full thing by simply glancing at the rating. Deep learning neural networks are used by the model for both text and audio processing in order to achieve this. The well-known RAVDESS dataset serves as the basis for the audio model's training and offers a wide range of emotional expressions. The suggested system uses two methods to provide reviews: text-based natural language processing and audio frequency spectrograms. By utilizing these two techniques, it may provide consumers accurate and trustworthy ratings while guaranteeing that the review procedure is not impeded. The aim is achieved with high accuracy to ensure that users can make informed decisions when purchasing products based on the provided reviews. With the aid of this review system, customers will be able to quickly find out crucial details about a product they are interested in, thus increasing their pleasure and loyalty

    The Wits intelligent teaching system (WITS): a smart lecture theatre to assess audience engagement

    Get PDF
    A Thesis submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Doctor of Philosophy, 2017The utility of lectures is directly related to the engagement of the students therein. To ensure the value of lectures, one needs to be certain that they are engaging to students. In small classes experienced lecturers develop an intuition of how engaged the class is as a whole and can then react appropriately to remedy the situation through various strategies such as breaks or changes in style, pace and content. As both the number of students and size of the venue grow, this type of contingent teaching becomes increasingly difļ¬cult and less precise. Furthermore, relying on intuition alone gives no way to recall and analyse previous classes or to objectively investigate trends over time. To address these problems this thesis presents the WITS INTELLIGENT TEACHING SYSTEM (WITS) to highlight disengaged students during class. A web-based, mobile application called Engage was developed to try elicit anonymous engagement information directly from students. The majority of students were unwilling or unable to self-report their engagement levels during class. This stems from a number of cultural and practical issues related to social display rules, unreliable internet connections, data costs, and distractions. This result highlights the need for a non-intrusive system that does not require the active participation of students. A nonintrusive, computer vision and machine learning based approach is therefore proposed. To support the development thereof, a labelled video dataset of students was built by recording a number of ļ¬rst year lectures. Students were labelled across a number of affects ā€“ including boredom, frustration, confusion, and fatigue ā€“ but poor inter-rater reliability meant that these labels could not be used as ground truth. Based on manual coding methods identiļ¬ed in the literature, a number of actions, gestures, and postures were identiļ¬ed as proxies of behavioural engagement. These proxies are then used in an observational checklist to mark students as engaged or not. A Support Vector Machine (SVM) was trained on Histograms of Oriented Gradients (HOG) to classify the students based on the identiļ¬ed behaviours. The results suggest a high temporal correlation of a single subjectā€™s video frames. This leads to extremely high accuracies on seen subjects. However, this approach generalised poorly to unseen subjects and more careful feature engineering is required. The use of Convolutional Neural Networks (CNNs) improved the classiļ¬cation accuracy substantially, both over a single subject and when generalising to unseen subjects. While more computationally expensive than the SVM, the CNN approach lends itself to parallelism using Graphics Processing Units (GPUs). With GPU hardware acceleration, the system is able to run in near real-time and with further optimisations a real-time classiļ¬er is feasible. The classiļ¬er provides engagement values, which can be displayed to the lecturer live during class. This information is displayed as an Interest Map which highlights spatial areas of disengagement. The lecturer can then make informed decisions about how to progress with the class, what teaching styles to employ, and on which students to focus. An Interest Map was presented to lecturers and professors at the University of the Witwatersrand yielding 131 responses. The vast majority of respondents indicated that they would like to receive live engagement feedback during class, that they found the Interest Map an intuitive visualisation tool, and that they would be interested in using such technology. Contributions of this thesis include the development of a labelled video dataset; the development of a web based system to allow students to self-report engagement; the development of cross-platform, open-source software for spatial, action and affect labelling; the application of Histogram of Oriented Gradient based Support Vector Machines, and Deep Convolutional Neural Networks to classify this data; the development of an Interest Map to intuitively display engagement information to presenters; and ļ¬nally an analysis of acceptance of such a system by educators.XL201
    • ā€¦
    corecore