1,343 research outputs found

    A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

    Get PDF
    Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202

    Exploring Human attitude during Human-Robot Interaction

    Get PDF
    The aim of this work is to provide an automatic analysis to assess the user attitude when interacts with a companion robot. In detail, our work focuses on defining which combination of social cues the robot should recognize so that to stimulate the ongoing conversation and how. The analysis is performed on video recordings of 9 elderly users. From each video, low-level descriptors of the behavior of the user are extracted by using open-source automatic tools to extract information on the voice, the body posture, and the face landmarks. The assessment of 3 types of attitude (neutral, positive and negative) is performed through 3 machine learning classification algorithms: k-nearest neighbors, random decision forest and support vector regression. Since intra- and intersubject variability could affect the results of the assessment, this work shows the robustness of the classification models in both scenarios. Further analysis is performed on the type of representation used to describe the attitude. A raw and an auto-encoded representation is applied to the descriptors. The results of the attitude assessment show high values of accuracy (>0.85) both for unimodal and multimodal data. The outcome of this work can be integrated into a robotic platform to automatically assess the quality of interaction and to modify its behavior accordingly

    Transformer-based Non-Verbal Emotion Recognition: Exploring Model Portability across Speakersā€™ Genders

    Get PDF
    Recognizing emotions in non-verbal audio tracks requires a deep understanding of their underlying features. Traditional classifiers relying on excitation, prosodic, and vocal traction features are not always capable of effectively generalizing across speakers' genders. In the ComParE 2022 vocalisation sub-challenge we explore the use of a Transformer architecture trained on contrastive audio examples. We leverage augmented data to learn robust non-verbal emotion classifiers. We also investigate the impact of different audio transformations, including neural voice conversion, on the classifier capability to generalize across speakers' genders. The empirical findings indicate that neural voice conversion is beneficial in the pretraining phase, yielding an improved model generality, whereas is harmful at the finetuning stage as hinders model specialization for the task of non-verbal emotion recognition

    The Emotional Impact of Audio - Visual Stimuli

    Get PDF
    Induced affect is the emotional effect of an object on an individual. It can be quantiļ¬ed through two metrics: valence and arousal. Valance quantifies how positive or negative something is, while arousal quantifies the intensity from calm to exciting. These metrics enable researchers to study how people opine on various topics. Affective content analysis of visual media is a challenging problem due to differences in perceived reactions. Industry standard machine learning classifiers such as Support Vector Machines can be used to help determine user affect. The best affect-annotated video datasets are often analyzed by feeding large amounts of visual and audio features through machine-learning algorithms. The goal is to maximize accuracy, with the hope that each feature will bring useful information to the table. We depart from this approach to quantify how different modalities such as visual, audio, and text description information can aid in the understanding affect. To that end, we train independent models for visual, audio and text description. Each are convolutional neural networks paired with support vector machines to classify valence and arousal. We also train various ensemble models that combine multi-modal information with the hope that the information from independent modalities benefits each other. We ļ¬nd that our visual network alone achieves state-of-the-art valence classiļ¬cation accuracy and that our audio network, when paired with our visual, achieves competitive results on arousal classiļ¬cation. Each network is much stronger on one metric than the other. This may lead to more sophisticated multimodal approaches to accurately identifying affect in video data. This work also contributes to induced emotion classification by augmenting existing sizable media datasets and providing a robust framework for classifying the same
    • ā€¦
    corecore