22 research outputs found

    Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations

    Full text link
    Emotion recognition in conversations is essential for ensuring advanced human-machine interactions. However, creating robust and accurate emotion recognition systems in real life is challenging, mainly due to the scarcity of emotion datasets collected in the wild and the inability to take into account the dialogue context. The CEMO dataset, composed of conversations between agents and patients during emergency calls to a French call center, fills this gap. The nature of these interactions highlights the role of the emotional flow of the conversation in predicting patient emotions, as context can often make a difference in understanding actual feelings. This paper presents a multi-scale conversational context learning approach for speech emotion recognition, which takes advantage of this hypothesis. We investigated this approach on both speech transcriptions and acoustic segments. Experimentally, our method uses the previous or next information of the targeted segment. In the text domain, we tested the context window using a wide range of tokens (from 10 to 100) and at the speech turns level, considering inputs from both the same and opposing speakers. According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens. Furthermore, taking the last speech turn of the same speaker in the conversation seems useful. In the acoustic domain, we conducted an in-depth analysis of the impact of the surrounding emotions on the prediction. While multi-scale conversational context learning using Transformers can enhance performance in the textual modality for emergency call recordings, incorporating acoustic context is more challenging

    End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency Call Centers Data Recordings

    No full text
    International audienceRecognizing a speaker's emotion from their speech can be a key element in emergency call centers. End-to-end deep learning systems for speech emotion recognition now achieve equivalent or even better results than conventional machine learning approaches. In this paper, in order to validate the performance of our neural network architecture for emotion recognition from speech, we first trained and tested it on the widely used corpus accessible by the community, IEMOCAP. We then used the same architecture as the real life corpus, CEMO, composed of 440 dialogs (2h16m) from 485 speakers. The most frequent emotions expressed by callers in these real life emergency dialogues are fear, anger and positive emotions such as relief. In the IEMOCAP general topic conversations, the most frequent emotions are sadness, anger and happiness. Using the same end-to-end deep learning architecture, an Unweighted Accuracy Recall (UA) of 63% is obtained on IEMOCAP and a UA of 45.6% on CEMO, each with 4 classes. Using only 2 classes (Anger, Neutral), the results for CEMO are 76.9% UA compared to 81.1% UA for IEMOCAP. We expect that these encouraging results with CEMO can be improved by combining the audio channel with the linguistic channel. Real-life emotions are clearly more complex than acted ones, mainly due to the large diversity of emotional expressions of speakers. Index Terms-emotion detection, end-to-end deep learning architecture, call center, real-life database, complex emotions

    Apprentissage Contextuel Multi-échelle pour la Reconnaissance des émotions dans les Conversations des Centres d'Appels d'Urgence.

    No full text
    International audienceEmotion recognition in conversations is essential for ensuring advanced human-machine interactions. However, creating robust and accurate emotion recognition systems in real life is challenging, mainly due to the scarcity of emotion datasets collected in the wild and the inability to take into account the dialogue context. The CEMO dataset, composed of conversations between agents and patients during emergency calls to a French call center, fills this gap. The nature of these interactions highlights the role of the conversation's emotional flow in predicting patient emotions, as context can often make a difference in understanding observed emotional expressions. This paper presents a multi-scale conversational context learning approach for speech emotion recognition, which takes advantage of this hypothesis. We investigated this approach on both speech transcriptions and acoustic segments. Experimentally, our method uses the previous or next information of the targeted segment. In the text domain, we tested the context window using a wide range of tokens (from 10 to 100) and at the speech turns level, considering inputs from both the same and opposing speakers. According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens. Furthermore, taking the last speech turn of the same speaker in the conversation seems useful. In the acoustic domain, we conducted an in-depth analysis of the impact of the surrounding emotions on the prediction. While multi-scale conversational context learning using Transformers can enhance performance in the textual modality for emergency call recordings, incorporating acoustic context is more challenging

    A Multi-Task, Multi-Modal Approach for Predicting Categorical and Dimensional Emotions

    No full text
    International audienceSpeech emotion recognition (SER) has received a great deal of attention in recent years in the context of spontaneous conversations. While there have been notable results on datasets like the wellknown corpus of naturalistic dyadic conversations, IEMOCAP, for both the case of categorical and dimensional emotions, there are few papers which try to predict both paradigms at the same time. Therefore, in this work, we aim to highlight the performance contribution of multi-task learning by proposing a multi-task, multi-modal system that predicts categorical and dimensional emotions. The results emphasise the importance of cross-regularisation between the two * Upon reviewing the fully released version of [20], it came to our attention that the proceedings version of this work did not include the speaker-independent results for [22]. Therefore, we include in brackets the speaker-independent results for a fair comparison with our model. Additionally, we were unable to ascertain any details regarding speaker-independence in [5], and therefore advise that these results be interpreted with caution

    Deep learning ancient map segmentation to assess historical landscape changes

    No full text
    International audienceAncient geographical maps are our window into the past for understanding the spatial dynamics of last centuries. This paper proposes a novel approach to address this problem using deep learning. Convolutional neural networks (CNNs) are today the state-of-the-art methods in handling a variety of problems in the fields of image processing. The Cassini map, created in the eighteenth century, is used to illustrate our methodology. This approach enables us to extract the surfaces of classes of lands in the Cassini map: forests, heaths, arboricultural, and hydrological. The evolution of land use between the end of the eighteenth century andtoday was quantified by comparison with Corine Land Cover (CLC) database. For the Rhone watershed, the results show that forests, arboriculture, and heaths are more extensive on the CLC map, in contrast to the hydrological network. These unprecedented results are new findings that reveal the major anthropo-climatic changes

    Deep learning ancient map segmentation to assess historical landscape changes

    No full text
    ABSTRACTAncient geographical maps are our window into the past for understanding the spatial dynamics of last centuries. This paper proposes a novel approach to address this problem using deep learning. Convolutional neural networks (CNNs) are today the state-of-the-art methods in handling a variety of problems in the fields of image processing. The Cassini map, created in the eighteenth century, is used to illustrate our methodology. This approach enables us to extract the surfaces of classes of lands in the Cassini map: forests, heaths, arboricultural, and hydrological. The evolution of land use between the end of the eighteenth century andtoday was quantified by comparison with Corine Land Cover (CLC) database. For the Rhone watershed, the results show that forests, arboriculture, and heaths are more extensive on the CLC map, in contrast to the hydrological network. These unprecedented results are new findings that reveal the major anthropo-climatic changes
    corecore