14,990 research outputs found
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation
Affect Recognition in Ads with Application to Computational Advertising
Advertisements (ads) often include strongly emotional content to leave a
lasting impression on the viewer. This work (i) compiles an affective ad
dataset capable of evoking coherent emotions across users, as determined from
the affective opinions of five experts and 14 annotators; (ii) explores the
efficacy of convolutional neural network (CNN) features for encoding emotions,
and observes that CNN features outperform low-level audio-visual emotion
descriptors upon extensive experimentation; and (iii) demonstrates how enhanced
affect prediction facilitates computational advertising, and leads to better
viewing experience while watching an online video stream embedded with ads
based on a study involving 17 users. We model ad emotions based on subjective
human opinions as well as objective multimodal features, and show how
effectively modeling ad emotions can positively impact a real-life application.Comment: Accepted at the ACM International Conference on Multimedia (ACM MM)
201
Generating Labels for Regression of Subjective Constructs using Triplet Embeddings
Human annotations serve an important role in computational models where the
target constructs under study are hidden, such as dimensions of affect. This is
especially relevant in machine learning, where subjective labels derived from
related observable signals (e.g., audio, video, text) are needed to support
model training and testing. Current research trends focus on correcting
artifacts and biases introduced by annotators during the annotation process
while fusing them into a single annotation. In this work, we propose a novel
annotation approach using triplet embeddings. By lifting the absolute
annotation process to relative annotations where the annotator compares
individual target constructs in triplets, we leverage the accuracy of
comparisons over absolute ratings by human annotators. We then build a
1-dimensional embedding in Euclidean space that is indexed in time and serves
as a label for regression. In this setting, the annotation fusion occurs
naturally as a union of sets of sampled triplet comparisons among different
annotators. We show that by using our proposed sampling method to find an
embedding, we are able to accurately represent synthetic hidden constructs in
time under noisy sampling conditions. We further validate this approach using
human annotations collected from Mechanical Turk and show that we can recover
the underlying structure of the hidden construct up to bias and scaling
factors.Comment: 9 pages, 5 figures, accepted journal pape
Deep fusion of multi-channel neurophysiological signal for emotion recognition and monitoring
How to fuse multi-channel neurophysiological signals for emotion recognition is emerging as a hot research topic in community of Computational Psychophysiology. Nevertheless, prior feature engineering based approaches require extracting various domain knowledge related features at a high time cost. Moreover, traditional fusion method cannot fully utilise correlation information between different channels and frequency components. In this paper, we design a hybrid deep learning model, in which the 'Convolutional Neural Network (CNN)' is utilised for extracting task-related features, as well as mining inter-channel and inter-frequency correlation, besides, the 'Recurrent Neural Network (RNN)' is concatenated for integrating contextual information from the frame cube sequence. Experiments are carried out in a trial-level emotion recognition task, on the DEAP benchmarking dataset. Experimental results demonstrate that the proposed framework outperforms the classical methods, with regard to both of the emotional dimensions of Valence and Arousal
Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition
This paper studies the emotion recognition from musical tracks in the
2-dimensional valence-arousal (V-A) emotional space. We propose a method based
on convolutional (CNN) and recurrent neural networks (RNN), having
significantly fewer parameters compared with the state-of-the-art method for
the same task. We utilize one CNN layer followed by two branches of RNNs
trained separately for arousal and valence. The method was evaluated using the
'MediaEval2015 emotion in music' dataset. We achieved an RMSE of 0.202 for
arousal and 0.268 for valence, which is the best result reported on this
dataset.Comment: Accepted for Sound and Music Computing (SMC 2017
Multimodal Content Analysis for Effective Advertisements on YouTube
The rapid advances in e-commerce and Web 2.0 technologies have greatly
increased the impact of commercial advertisements on the general public. As a
key enabling technology, a multitude of recommender systems exists which
analyzes user features and browsing patterns to recommend appealing
advertisements to users. In this work, we seek to study the characteristics or
attributes that characterize an effective advertisement and recommend a useful
set of features to aid the designing and production processes of commercial
advertisements. We analyze the temporal patterns from multimedia content of
advertisement videos including auditory, visual and textual components, and
study their individual roles and synergies in the success of an advertisement.
The objective of this work is then to measure the effectiveness of an
advertisement, and to recommend a useful set of features to advertisement
designers to make it more successful and approachable to users. Our proposed
framework employs the signal processing technique of cross modality feature
learning where data streams from different components are employed to train
separate neural network models and are then fused together to learn a shared
representation. Subsequently, a neural network model trained on this joint
feature embedding representation is utilized as a classifier to predict
advertisement effectiveness. We validate our approach using subjective ratings
from a dedicated user study, the sentiment strength of online viewer comments,
and a viewer opinion metric of the ratio of the Likes and Views received by
each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201
- …