124 research outputs found
Conversational Emotion Analysis via Attention Mechanisms
Different from the emotion recognition in individual utterances, we propose a
multimodal learning framework using relation and dependencies among the
utterances for conversational emotion analysis. The attention mechanism is
applied to the fusion of the acoustic and lexical features. Then these fusion
representations are fed into the self-attention based bi-directional gated
recurrent unit (GRU) layer to capture long-term contextual information. To
imitate real interaction patterns of different speakers, speaker embeddings are
also utilized as additional inputs to distinguish the speaker identities during
conversational dialogs. To verify the effectiveness of the proposed method, we
conduct experiments on the IEMOCAP database. Experimental results demonstrate
that our method shows absolute 2.42% performance improvement over the
state-of-the-art strategies
MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition
Dynamic facial expression recognition (DFER) is essential to the development
of intelligent and empathetic machines. Prior efforts in this field mainly fall
into supervised learning paradigm, which is severely restricted by the limited
labeled data in existing datasets. Inspired by recent unprecedented success of
masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel
self-supervised method which leverages large-scale self-supervised pre-training
on abundant unlabeled data to largely advance the development of DFER. Since
the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial
computation during fine-tuning, MAE-DFER develops an efficient local-global
interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to
the standalone appearance content reconstruction in VideoMAE, MAE-DFER also
introduces explicit temporal facial motion modeling to encourage LGI-Former to
excavate both static appearance and dynamic motion information. Extensive
experiments on six datasets show that MAE-DFER consistently outperforms
state-of-the-art supervised methods by significant margins (e.g., +6.30\% UAR
on DFEW and +8.34\% UAR on MAFW), verifying that it can learn powerful dynamic
facial representations via large-scale self-supervised pre-training. Besides,
it has comparable or even better performance than VideoMAE, while largely
reducing the computational cost (about 38\% FLOPs). We believe MAE-DFER has
paved a new way for the advancement of DFER and can inspire more relevant
research in this field and even other related tasks. Codes and models are
publicly available at https://github.com/sunlicai/MAE-DFER.Comment: ACM MM 2023 (camera ready). Codes and models are publicly available
at https://github.com/sunlicai/MAE-DFE
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-ware intelligent
machines. Previous efforts in this area are dominated by the supervised
learning paradigm. Despite significant progress, supervised learning is meeting
its bottleneck due to the longstanding data scarcity issue in AVER. Motivated
by recent advances in self-supervised learning, we propose Hierarchical
Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that
leverages large-scale self-supervised pre-training on vast unlabeled
audio-visual data to promote the advancement of AVER. Following prior arts in
self-supervised audio-visual representation learning, HiCMAE adopts two primary
forms of self-supervision for pre-training, namely masked data modeling and
contrastive learning. Unlike them which focus exclusively on top-layer
representations while neglecting explicit guidance of intermediate layers,
HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual
feature learning and improve the overall quality of learned representations. To
verify the effectiveness of HiCMAE, we conduct extensive experiments on 9
datasets covering both categorical and dimensional AVER tasks. Experimental
results show that our method significantly outperforms state-of-the-art
supervised and self-supervised audio-visual methods, which indicates that
HiCMAE is a powerful audio-visual emotion representation learner. Codes and
models will be publicly available at https://github.com/sunlicai/HiCMAE.Comment: Accepted by Information Fusion. The code is available at
https://github.com/sunlicai/HiCMA
Model Rute dan Peta Interaktif Posyandu di Kota Semarang Menggunakan Geolocation dan Haversine Berbasis Mobile Android
Users of mobile phones with android operating system is increasing from year to year as more affordable. Thisgives the potential use of mobile phones as a means of disseminating information. As the capital of Central Java,Semarang has a variety of health care facilities which include Posyandu is always used by the surroundingcommunity. However, not all locations Posyandu known by the public because they lack information about thelocation. It is necessary for the navigation application Posyandu in Semarang. The purpose of this research ishow to design the application model and the location of Posyandu in real time with the geolocation methods andformulas Haversine. The method used is the Systems Development Life Cycle. System analysis model using UseCase Diagrams, Activity Diagrams, Sequence Diagrams and Class Diagrams. Results of this research is theapplication of navigation map based on Android mobile can provide information about the existence of thelocation, route and distance Posyandu. Results of this application is important for people who need drivingdirections Posyandu location. With this application is expected the results will help the public to obtaininformation and the location of the neighborhood health center (posyandu) in the city of Semarang can be met
SVFAP: Self-supervised Video Facial Affect Perceiver
Video-based facial affect analysis has recently attracted increasing
attention owing to its critical role in human-computer interaction. Previous
studies mainly focus on developing various deep learning architectures and
training them in a fully supervised manner. Although significant progress has
been achieved by these supervised methods, the longstanding lack of large-scale
high-quality labeled data severely hinders their further improvements.
Motivated by the recent success of self-supervised learning in computer vision,
this paper introduces a self-supervised approach, termed Self-supervised Video
Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised
methods. Specifically, SVFAP leverages masked facial video autoencoding to
perform self-supervised pre-training on massive unlabeled facial videos.
Considering that large spatiotemporal redundancy exists in facial videos, we
propose a novel temporal pyramid and spatial bottleneck Transformer as the
encoder of SVFAP, which not only enjoys low computational cost but also
achieves excellent performance. To verify the effectiveness of our method, we
conduct experiments on nine datasets spanning three downstream tasks, including
dynamic facial expression recognition, dimensional emotion recognition, and
personality recognition. Comprehensive results demonstrate that SVFAP can learn
powerful affect-related representations via large-scale self-supervised
pre-training and it significantly outperforms previous state-of-the-art methods
on all datasets. Codes will be available at https://github.com/sunlicai/SVFAP.Comment: Submitted to IEEE Trans. on Affective Computing (February 8, 2023
Explainable Multimodal Emotion Reasoning
Multimodal emotion recognition is an active research topic in artificial
intelligence. Its primary objective is to integrate multi-modalities (such as
acoustic, visual, and lexical clues) to identify human emotional states.
Current works generally assume accurate emotion labels for benchmark datasets
and focus on developing more effective architectures. But due to the inherent
subjectivity of emotions, existing datasets often lack high annotation
consistency, resulting in potentially inaccurate labels. Consequently, models
built on these datasets may struggle to meet the demands of practical
applications. To address this issue, it is crucial to enhance the reliability
of emotion annotations. In this paper, we propose a novel task called
``\textbf{Explainable Multimodal Emotion Reasoning (EMER)}''. In contrast to
previous works that primarily focus on predicting emotions, EMER takes a step
further by providing explanations for these predictions. The prediction is
considered correct as long as the reasoning process behind the predicted
emotion is plausible. This paper presents our initial efforts on EMER, where we
introduce a benchmark dataset, establish baseline models, and define evaluation
metrics. Meanwhile, we observe the necessity of integrating multi-faceted
capabilities to deal with EMER. Therefore, we propose the first multimodal
large language model (LLM) in affective computing, called \textbf{AffectGPT}.
We aim to tackle the long-standing challenge of label ambiguity and chart a
path toward more reliable techniques. Furthermore, EMER offers an opportunity
to evaluate the audio-video-text understanding capabilities of recent
multimodal LLM. To facilitate further research, we make the code and data
available at: https://github.com/zeroQiaoba/AffectGPT
- …