Search CORE

4 research outputs found

Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks

Author: Dutu Liviu Cristian
Radoi Anamaria
Ristea Nicolae-Catalin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/02/2020
Field of study

Emotion recognition has become an important field of research in the human-computer interactions domain. The latest advancements in the field show that combining visual with audio information lead to better results if compared to the case of using a single source of information separately. From a visual point of view, a human emotion can be recognized by analyzing the facial expression of the person. More precisely, the human emotion can be described through a combination of several Facial Action Units. In this paper, we propose a system that is able to recognize emotions with a high accuracy rate and in real time, based on deep Convolutional Neural Networks. In order to increase the accuracy of the recognition system, we analyze also the speech data and fuse the information coming from both sources, i.e., visual and audio. Experimental results show the effectiveness of the proposed scheme for emotion recognition and the importance of combining visual with audio data

arXiv.org e-Print Archive

Crossref

Analysis of constant-Q filterbank based representations for speech emotion recognition

Author: Saha Goutam
Sahidullah Md
Singh Premjeet
Waldekar Shefali
Publication venue: 'Elsevier BV'
Publication date: 01/10/2022
Field of study

International audienceThis work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectrotemporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency spectral coefficients (MFSCs) and constant-Q filterbank-based features, namely constant-Q transform (CQT) and continuous wavelet transform (CWT), reveals that constant-Q representations provide higher time-invariance at low-frequencies. This provides increased robustness against emotion irrelevant temporal variations in pitch, especially for low-arousal emotions. The corresponding frequency-domain analysis over different emotion classes shows better resolution of pitch harmonics in constant-Q-based time-frequency representations than MFSC. These advantages of constant-Q representations are further consolidated by SER performance in the extensive evaluation of features over four publicly available databases with six advanced deep neural network architectures as the back-end classifiers. Our inferences in this study hint toward the suitability and potentiality of constant-Q features for SER

INRIA a CCSD electronic archive server

Effective multi-modal conversational recommendation

Author: Wu Yaxiong
Publication venue
Publication date: 01/01/2024
Field of study

Conversational recommender systems have recently received much attention for addressing the information asymmetry problem in information seeking, by eliciting the dynamic preferences of users and taking actions based on their current needs through multi-turn & closed-loop interactions. Despite recent advances in uni-modal conversational recommender systems that use only natural-language interfaces for recommendations, leveraging both visual and textual information effectively for multi-modal conversational recommender systems has not yet been fully researched. In particular, multi-modal conversational recommender systems are expected to leverage the multi-modal information (such as the natural-language feedback of users and textual/visual representations of recommendation items) during the communications between users and recommender systems. In this thesis, we aim to effectively track and estimate the users’ dynamic preferences from the multi-modal conversational recommendations (in particular with vision-and-language-based interactions), so as to develop realistic and effective multi-modal conversational recommender systems. In particular, we are motivated to answer the following questions: (1) how to better understand the users’ natural-language feedback and the corresponding recommendations with the partial observability of the users’ preferences over time; (2) how to better track the users’ preferences over the sequences of the systems’ visual recommendations and the users’ naturallanguage feedback; (3) how to decouple the recommendation policy (i.e. model) optimisation and the multi-modal composition representation learning; (4) how to effectively incorporate the users’ long-term and short-term interests for both cold-start and warm-start users; (5) how to ensure the realism of simulated conversations, such as positive/negative natural-language feedback. To address these five challenges, we propose to leverage recent advanced techniques (including multi-modal learning, deep learning, and reinforcement learning) for re-framing and developing more effective multi-modal conversational recommender systems. In particular, we introduce the framework of the multi-modal conversational recommendation task with cold-start or warm-start users, as well as how to measure the success of the tasks. Note that we also refer to multi-modal conversational recommendation as dialog-based interactive recommendation or multi-modal interactive recommendation throughout this thesis. The first challenge refers to the partial observability in natural-language feedback. For example, the users’ feedback, which takes the form of natural-language critiques about the displayed recommendation at each iteration, can only allow the recommender system to obtain a partial portrayal of the users’ preferences. To alleviate such a partial observation issue, we propose a novel dialog-based recommendation model, the Estimator-Generator-Evaluator (EGE) model, which uses Q-learning for a partially observable Markov decision process (POMDP), to effectively incorporate the users’ preferences over time. Specifically, we leverage an Estimator to track and estimate users’ preferences, a Generator to match the estimated preferences with the candidate items to rank the next recommendations, and an Evaluator to judge the quality of the estimated preferences considering the users’ historical feedback. The second challenge refers to multi-modal sequence dependency issue in multi-modal dialog state tracking. For instance, multi-modal dialog sequences (i.e. turns consisting of the system’s visual recommendations and the user’s natural-language feedback) make it challenging to correctly incorporate the users’ preferences across multiple turns. Indeed, the existing formulations of interactive recommender systems suffer from their inability to capture the multi-modal sequential dependencies of textual feedback and visual recommendations because of their use of recurrent neural network-based (i.e., RNN-based) or transformer-based models. To alleviate the multi-modal sequence dependency issue, we propose a novel multi-modal recurrent attention network (MMRAN) model to effectively incorporate the users’ preferences over the long visual dialog sequences of the users’ natural-language feedback and the system’s visual recommendations. The third challenge refers to the coupling issue of policy (i.e. recommendation model) optimisation and representation learning. For example, it is typically challenging and unstable to optimise a recommendation agent to improve the recommendation quality associated with implicit learning of multi-modal representations in an end-to-end fashion in deep reinforcement learning (DRL). To address this coupling issue, we propose a novel goal-oriented multi-modal interactive recommendation model (GOMMIR) that uses both verbal and non-verbal relevance feedback to effectively incorporate the users’ preferences over time. Specifically, our GOMMIR model employs a multi-task learning approach (using goal-oriented reinforcement learning (GORL)) to explicitly learn the multi-modal representations using a multi-modal composition network when optimising the recommendation agent. The fourth challenge refers to the personalisation for cold-start and warm-start users. For instance, it can be challenging to make satisfactory personalised recommendations across multiple interactions due to the difficulty in balancing the users’ past interests and the current needs for generating the users’ state (i.e. current preferences) representations over time. To perform the personalisation for cold-start and warm-start users, we propose a novel personalised multimodal interactive recommendation model (PMMIR) using hierarchical reinforcement learning (HRL) to more effectively incorporate the users’ preferences from both their past and real-time interactions. The final challenge refers to the realism of simulated conversations. In a real-world shop-ping scenario, users can express their natural-language feedback when communicating with a shopping assistant by stating their satisfactions positively with “I like” or negatively with “I dislike” according to the quality of the recommended fashion products. A multi-modal conversational recommender system (using text and images in particular) aims to replicate this process by eliciting the dynamic preferences of users from their natural-language feedback and updating the visual recommendations so as to satisfy the users’ current needs through multi-turn interactions. However, the impact of positive and negative natural-language feedback on the effectiveness of multi-modal conversational recommendation has not yet been fully explored. To further explore the multi-modal conversational recommendation with positive and negative natural-language feedback, we investigate the effectiveness of the recent multi-modal conversational recommendation models for effectively incorporating the users’ preferences over time from both positively and negatively natural-language oriented feedback corresponding to the visual recommendations. Overall, we contribute an effective multi-modal conversational recommendation framework that make accurate recommendations by leveraging visual and textual information. This framework includes models for tracking users’ preferences with partial observations, mitigating the multi-modal sequence dependency issue, decoupling the composition representation learning from policy optimisation, incorporating both the users’ long-term preferences and short-term needs for personalisation, and ensuring the realism of simulated conversations. These contributions make progress in the development of multi-modal conversational recommendation techniques and could inspire future directions of research in recommendation systems

Glasgow Theses Service