6,036 research outputs found
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation
Text-based Emotion Aware Recommender
We apply the concept of users' emotion vectors (UVECs) and movies' emotion
vectors (MVECs) as building components of Emotion Aware Recommender System. We
built a comparative platform that consists of five recommenders based on
content-based and collaborative filtering algorithms. We employed a Tweets
Affective Classifier to classify movies' emotion profiles through movie
overviews. We construct MVECs from the movie emotion profiles. We track users'
movie watching history to formulate UVECs by taking the average of all the
MVECs from all the movies a user has watched. With the MVECs, we built an
Emotion Aware Recommender as one of the comparative platforms' algorithms. We
evaluated the top-N recommendation lists generated by these Recommenders and
found the top-N list of Emotion Aware Recommender showed serendipity
recommendations.Comment: 13 pages, 8 tables, International Conference on Natural Language
Computing and AI (NLCAI2020) July25-26, London, United Kingdo
MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild
Dynamic facial expression recognition (FER) databases provide important data
support for affective computing and applications. However, most FER databases
are annotated with several basic mutually exclusive emotional categories and
contain only one modality, e.g., videos. The monotonous labels and modality
cannot accurately imitate human emotions and fulfill applications in the real
world. In this paper, we propose MAFW, a large-scale multi-modal compound
affective database with 10,045 video-audio clips in the wild. Each clip is
annotated with a compound emotional category and a couple of sentences that
describe the subjects' affective behaviors in the clip. For the compound
emotion annotation, each clip is categorized into one or more of the 11
widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness,
surprise, contempt, anxiety, helplessness, and disappointment. To ensure high
quality of the labels, we filter out the unreliable annotations by an
Expectation Maximization (EM) algorithm, and then obtain 11 single-label
emotion categories and 32 multi-label emotion categories. To the best of our
knowledge, MAFW is the first in-the-wild multi-modal database annotated with
compound emotion annotations and emotion-related captions. Additionally, we
also propose a novel Transformer-based expression snippet feature learning
method to recognize the compound emotions leveraging the expression-change
relations among different emotions and modalities. Extensive experiments on
MAFW database show the advantages of the proposed method over other
state-of-the-art methods for both uni- and multi-modal FER. Our MAFW database
is publicly available from https://mafw-database.github.io/MAFW.Comment: This paper has been accepted by ACM MM'2
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
Understanding human emotions is a crucial ability for intelligent robots to
provide better human-robot interactions. The existing works are limited to
trimmed video-level emotion classification, failing to locate the temporal
window corresponding to the emotion. In this paper, we introduce a new task,
named Temporal Emotion Localization in videos~(TEL), which aims to detect human
emotions and localize their corresponding temporal boundaries in untrimmed
videos with aligned subtitles. TEL presents three unique challenges compared to
temporal action localization: 1) The emotions have extremely varied temporal
dynamics; 2) The emotion cues are embedded in both appearances and complex
plots; 3) The fine-grained temporal annotations are complicated and
labor-intensive. To address the first two challenges, we propose a novel
dilated context integrated network with a coarse-fine two-stream architecture.
The coarse stream captures varied temporal dynamics by modeling
multi-granularity temporal contexts. The fine stream achieves complex plots
understanding by reasoning the dependency between the multi-granularity
temporal contexts from the coarse stream and adaptively integrates them into
fine-grained video segment features. To address the third challenge, we
introduce a cross-modal consensus learning paradigm, which leverages the
inherent semantic consensus between the aligned video and subtitle to achieve
weakly-supervised learning. We contribute a new testing set with 3,000
manually-annotated temporal boundaries so that future research on the TEL
problem can be quantitatively evaluated. Extensive experiments show the
effectiveness of our approach on temporal emotion localization. The repository
of this work is at
https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.Comment: Accepted by ACM Multimedia 202
The Emotional Impact of Audio - Visual Stimuli
Induced affect is the emotional effect of an object on an individual. It can be quantified through two metrics: valence and arousal. Valance quantifies how positive or negative something is, while arousal quantifies the intensity from calm to exciting. These metrics enable researchers to study how people opine on various topics. Affective content analysis of visual media is a challenging problem due to differences in perceived reactions. Industry standard machine learning classifiers such as Support Vector Machines can be used to help determine user affect. The best affect-annotated video datasets are often analyzed by feeding large amounts of visual and audio features through machine-learning algorithms. The goal is to maximize accuracy, with the hope that each feature will bring useful information to the table.
We depart from this approach to quantify how different modalities such as visual, audio, and text description information can aid in the understanding affect. To that end, we train independent models for visual, audio and text description. Each are convolutional neural networks paired with support vector machines to classify valence and arousal. We also train various ensemble models that combine multi-modal information with the hope that the information from independent modalities benefits each other.
We find that our visual network alone achieves state-of-the-art valence classification accuracy and that our audio network, when paired with our visual, achieves competitive results on arousal classification. Each network is much stronger on one metric than the other. This may lead to more sophisticated multimodal approaches to accurately identifying affect in video data. This work also contributes to induced emotion classification by augmenting existing sizable media datasets and providing a robust framework for classifying the same
- …