49,772 research outputs found

    Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

    Get PDF
    Automatic emotion recognition from speech, which is an important and challenging task in the field of affective computing, heavily relies on the effectiveness of the speech features for classification. Previous approaches to emotion recognition have mostly focused on the extraction of carefully hand-crafted features. How to model spatio-temporal dynamics for speech emotion recognition effectively is still under active investigation. In this paper, we propose a method to tackle the problem of emotional relevant feature extraction from speech by leveraging Attention-based Bidirectional Long Short-Term Memory Recurrent Neural Networks with fully convolutional networks in order to automatically learn the best spatio-temporal representations of speech signals. The learned high-level features are then fed into a deep neural network (DNN) to predict the final emotion. The experimental results on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) and the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpora show that our method provides more accurate predictions compared with other existing emotion recognition algorithms

    Learning Attention Mechanisms and Context: An Investigation into Vision and Emotion

    Get PDF
    Attention mechanisms for context modelling are becoming ubiquitous in neural architectures in machine learning. The attention mechanism is a technique that filters out information that is irrelevant to a given task and focuses on learning task-dependent fixation points or regions. Furthermore, attention mechanisms suggest a question about a given task, i.e. `what' to learn and `where/how' to learn for task-specific context modelling. The context is the conditional variables instrumental in deciding the categorical distribution for the given data. Also, why is learning task-specific context necessary? In order to answer these questions, context modelling with attention in the vision and emotion domains is explored in this thesis using attention mechanisms with different hierarchical structures. The three main goals of this thesis are building superior classifiers using attention-based deep neural networks~(DNNs), investigating the role of context modelling in the given tasks, and developing a framework for interpreting hierarchies and attention in deep attention networks. In the vision domain, gesture and posture recognition tasks in diverse environments, are chosen. In emotion, visual and speech emotion recognition tasks are chosen. These tasks are selected for their sequential properties for modelling a spatiotemporal context. One of the key challenges from a machine learning standpoint is to extract patterns which bear maximum correlation with the information encoded in its signal while being as insensitive as possible to other types of information carried by the signal. A possible way to overcome this problem is to learn task-dependent representations. In order to achieve that, novel spatiotemporal context modelling networks and the mixture of multi-view attention~(MOMA) networks are proposed using bidirectional long-short-term memory network (BLSTM), convolutional neural network~(CNN), Capsule and attention networks. A framework has been proposed to interpret the internal attention states with respect to the given task. The results of the classifiers in the assigned tasks are compared with the \textit{state-of-the-art} DNNs, and the proposed classifiers achieve superior results. The context in speech emotion recognition is explored deeply with the attention interpretation framework, and it shows that the proposed model can assign word importance based on acoustic context. Furthermore, it has been observed that the internal states of the attention bear correlation with human perception of acoustic cues for speech emotion recognition. Overall, the results demonstrate superior classifiers and context learning models with interpretable frameworks. The findings are very important for speech emotion recognition systems. In this thesis, not only better models are produced, but also the interpretability of those models are explored, and their internal states are analysed. The phones and words are aligned with the attention vectors, and it is seen that the vowel sounds are more important for defining emotion acoustic cues than the consonants, and the model can assign word importance based on acoustic context. Also, how these approaches for emotion recognition using word importance for predicting emotions are demonstrated by the attention weight visualisation over the words. In a broader perspective, the findings from the thesis about gesture, posture and emotion recognition may be helpful in tasks like human-robot interaction~(HRI) and conversational artificial agents (such as Siri, Alexa). The communication is grounded with the symbolic and sub-symbolic cues of intent either from visual, audio or haptics. The understanding of intent is much dependent on the reasoning about the situational context. Emotion, i.e.\ speech and visual emotion, provides context to a situation, and it is a deciding factor in the response generation. Emotional intelligence and information from vision, audio and other modalities are essential for making human-human and human-robot communication more natural and feedback-driven
    • …
    corecore