18,379 research outputs found
Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenge and an important step towards more
natural human-computer interaction (HCI). The popular approach is multimodal
emotion recognition based on model-level fusion, which means that the
multimodal signals can be encoded to acquire embeddings, and then the
embeddings are concatenated together for the final classification. However, due
to the influence of noise or other factors, each modality does not always tend
to the same emotional category, which affects the generalization of a model. In
this paper, we propose a novel regularization method via contrastive learning
for multimodal emotion recognition using audio and text. By introducing a
discriminator to distinguish the difference between the same and different
emotional pairs, we explicitly restrict the latent code of each modality to
contain the same emotional information, so as to reduce the noise interference
and get more discriminative representation. Experiments are performed on the
standard IEMOCAP dataset for 4-class emotion recognition. The results show a
significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA)
and unweighted accuracy (UA) compared to the baseline system.Comment: Completed in October 2020 and submitted to ICASSP202
Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos
This paper presents a novel approach to perform sentiment analysis of news
videos, based on the fusion of audio, textual and visual clues extracted from
their contents. The proposed approach aims at contributing to the
semiodiscoursive study regarding the construction of the ethos (identity) of
this media universe, which has become a central part of the modern-day lives of
millions of people. To achieve this goal, we apply state-of-the-art
computational methods for (1) automatic emotion recognition from facial
expressions, (2) extraction of modulations in the participants' speeches and
(3) sentiment analysis from the closed caption associated to the videos of
interest. More specifically, we compute features, such as, visual intensities
of recognized emotions, field sizes of participants, voicing probability, sound
loudness, speech fundamental frequencies and the sentiment scores (polarities)
from text sentences in the closed caption. Experimental results with a dataset
containing 520 annotated news videos from three Brazilian and one American
popular TV newscasts show that our approach achieves an accuracy of up to 84%
in the sentiments (tension levels) classification task, thus demonstrating its
high potential to be used by media analysts in several applications,
especially, in the journalistic domain.Comment: 5 pages, 1 figure, International AAAI Conference on Web and Social
Medi
The Verbal and Non Verbal Signals of Depression -- Combining Acoustics, Text and Visuals for Estimating Depression Level
Depression is a serious medical condition that is suffered by a large number
of people around the world. It significantly affects the way one feels, causing
a persistent lowering of mood. In this paper, we propose a novel
attention-based deep neural network which facilitates the fusion of various
modalities. We use this network to regress the depression level. Acoustic, text
and visual modalities have been used to train our proposed network. Various
experiments have been carried out on the benchmark dataset, namely, Distress
Analysis Interview Corpus - a Wizard of Oz (DAIC-WOZ). From the results, we
empirically justify that the fusion of all three modalities helps in giving the
most accurate estimation of depression level. Our proposed approach outperforms
the state-of-the-art by 7.17% on root mean squared error (RMSE) and 8.08% on
mean absolute error (MAE).Comment: 10 pages including references, 2 figure
- …