28,097 research outputs found

    Multimodal Emotion Recognition System Using Machine Learning Classifier

    Get PDF
    Multi-modal emotion recognition refers to the process of identifying human emotions using information from multiple sources, such as facial expressions, voice intonation, EEG signals etc. Ultimately, emotion recognition is poised to play a pivotal role in healthcare, education, customer service etc. As we progress, it is imperative to address privacy concerns associated with this technology in a responsible manner. Challenges in multi-modal emotion recognition include aligning data from different modalities in time, dealing with noisy or incomplete information. In this paper, we aim to address this issue by employing the SVM as our machine learning classifier. Here we use IEMOCAP for speech and video and DEAP dataset for EEG signals. After applying SVM we got 76.22 % accuracy for IEMOCAP and 68.89 % accuracy for DEAP dataset

    Bridging Modalities: Knowledge Distillation and Masked Training for Translating Multi-Modal Emotion Recognition to Uni-Modal, Speech-Only Emotion Recognition

    Full text link
    This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques

    Efficient emotion recognition using hyperdimensional computing with combinatorial channel encoding and cellular automata

    Full text link
    In this paper, a hardware-optimized approach to emotion recognition based on the efficient brain-inspired hyperdimensional computing (HDC) paradigm is proposed. Emotion recognition provides valuable information for human-computer interactions, however the large number of input channels (>200) and modalities (>3) involved in emotion recognition are significantly expensive from a memory perspective. To address this, methods for memory reduction and optimization are proposed, including a novel approach that takes advantage of the combinatorial nature of the encoding process, and an elementary cellular automaton. HDC with early sensor fusion is implemented alongside the proposed techniques achieving two-class multi-modal classification accuracies of >76% for valence and >73% for arousal on the multi-modal AMIGOS and DEAP datasets, almost always better than state of the art. The required vector storage is seamlessly reduced by 98% and the frequency of vector requests by at least 1/5. The results demonstrate the potential of efficient hyperdimensional computing for low-power, multi-channeled emotion recognition tasks

    HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

    Full text link
    Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.Comment: 11 pages, 6 figure

    TACOformer:Token-channel compounded Cross Attention for Multimodal Emotion Recognition

    Full text link
    Recently, emotion recognition based on physiological signals has emerged as a field with intensive research. The utilization of multi-modal, multi-channel physiological signals has significantly improved the performance of emotion recognition systems, due to their complementarity. However, effectively integrating emotion-related semantic information from different modalities and capturing inter-modal dependencies remains a challenging issue. Many existing multimodal fusion methods ignore either token-to-token or channel-to-channel correlations of multichannel signals from different modalities, which limits the classification capability of the models to some extent. In this paper, we propose a comprehensive perspective of multimodal fusion that integrates channel-level and token-level cross-modal interactions. Specifically, we introduce a unified cross attention module called Token-chAnnel COmpound (TACO) Cross Attention to perform multimodal fusion, which simultaneously models channel-level and token-level dependencies between modalities. Additionally, we propose a 2D position encoding method to preserve information about the spatial distribution of EEG signal channels, then we use two transformer encoders ahead of the fusion module to capture long-term temporal dependencies from the EEG signal and the peripheral physiological signal, respectively. Subject-independent experiments on emotional dataset DEAP and Dreamer demonstrate that the proposed model achieves state-of-the-art performance.Comment: Accepted by IJCAI 2023- AI4TS worksho

    Emotion in faces and voices : recognition and production by young and older adults

    Get PDF
    Older adults are less accurate than younger adults at emotion recognition. Given that deficits in emotion recognition have been associated with interpersonal conflict and, in turn, a reduced quality of life, understanding how older adults process emotion is vital. Thus, the research strategy of the current thesis was to investigate specific questions concerning how older adults process emotion information, e.g., how well older adults process a variety of emotional expression types and whether having multiple sources of expressive information will improve emotion recognition; how older adults extract information from emotional expressions; where problems might occur during the emotion recognition process; and whether poor emotion recognition is related to poor emotion production. My approach was different to the majority of other studies in that I tested multi-modal, dynamic spoken expressions in addition to the unimodal and static emotional expressions that are typically used for assessing emotion recognition. The first experiment assessed emotion recognition for auditory-visual (AV), visual-only (VO), and auditory-only (AO) speech stimuli (Chapter 2). The second series of experiments focused on VO expressions of emotion and investigated how older adults extract information from such expressions. The final two experiments investigated potential problems that older adults may encounter during the emotion recognition process. Taken together, the experiments in this thesis provide insight into the differences in the way older and younger adults process emotional expressions. Such insights can be used to develop reliable and ecologically valid tools to assess the emotion recognition ability of older adults

    Robust Latent Representations via Cross-Modal Translation and Alignment

    Full text link
    Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when the signals from some modalities are unavailable or are severely degraded by noise. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of the weaker modalities. The translation from the weaker to the stronger modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representations in a shared latent space. We validate the proposed approach on the AVEC 2016 dataset for continuous emotion recognition and show the effectiveness of the approach that achieves state-of-the-art (uni-modal) performance for weaker modalities

    Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

    Full text link
    Human state recognition is a critical topic with pervasive and important applications in human-machine systems.Multi-modal fusion, the combination of metrics from multiple data sources, has been shown as a sound method for improving the recognition performance. However, while promising results have been reported by recent multi-modal-based models, they generally fail to leverage the sophisticated fusion strategies that would model sufficient cross-modal interactions when producing the fusion representation; instead, current methods rely on lengthy and inconsistent data preprocessing and feature crafting. To address this limitation, we propose an end-to-end multi-modal transformer framework for multi-modal human state recognition called Husformer.Specifically, we propose to use cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Using two such attention mechanisms enables effective and adaptive adjustments to noise and interruptions in multi-modal signals during the fusion process and in relation to high-level features. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad) demonstrate that in the recognition of human state, our Husformer outperforms both state-of-the-art multi-modal baselines and the use of a single modality by a large margin, especially when dealing with raw multi-modal signals. We also conducted an ablation study to show the benefits of each component in Husformer
    • …
    corecore