334 research outputs found
Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition
Various types of sensors have been considered to develop human action
recognition (HAR) models. Robust HAR performance can be achieved by fusing
multimodal data acquired by different sensors. In this paper, we introduce a
new multimodal fusion architecture, referred to as Unified Contrastive Fusion
Transformer (UCFFormer) designed to integrate data with diverse distributions
to enhance HAR performance. Based on the embedding features extracted from each
modality, UCFFormer employs the Unified Transformer to capture the
inter-dependency among embeddings in both time and modality domains. We present
the Factorized Time-Modality Attention to perform self-attention efficiently
for the Unified Transformer. UCFFormer also incorporates contrastive learning
to reduce the discrepancy in feature distributions across various modalities,
thus generating semantically aligned features for information fusion.
Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU
RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance,
outperforming competing methods by considerable margins
Cross-Attention is Not Enough: Incongruity-Aware Multimodal Sentiment Analysis and Emotion Recognition
Fusing multiple modalities for affective computing tasks has proven effective
for performance improvement. However, how multimodal fusion works is not well
understood, and its use in the real world usually results in large model sizes.
In this work, on sentiment and emotion analysis, we first analyze how the
salient affective information in one modality can be affected by the other in
crossmodal attention. We find that inter-modal incongruity exists at the latent
level due to crossmodal attention. Based on this finding, we propose a
lightweight model via Hierarchical Crossmodal Transformer with Modality Gating
(HCT-MG), which determines a primary modality according to its contribution to
the target task and then hierarchically incorporates auxiliary modalities to
alleviate inter-modal incongruity and reduce information redundancy. The
experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and
IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms
major prior work by achieving competitive results and can successfully
recognize hard samples; 2) mitigates the inter-modal incongruity at the latent
level when modalities have mismatched affective tendencies; 3) reduces model
size to less than 1M parameters while outperforming existing models of similar
sizes.Comment: *Equal contributio
Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
RGB-D action and gesture recognition remain an interesting topic in
human-centered scene understanding, primarily due to the multiple granularities
and large variation in human motion. Although many RGB-D based action and
gesture recognition approaches have demonstrated remarkable results by
utilizing highly integrated spatio-temporal representations across multiple
modalities (i.e., RGB and depth data), they still encounter several challenges.
Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion
differences between local clips under different modalities. Secondly, the
intricate nature of highly integrated spatio-temporal modeling can lead to
optimization difficulties. Thirdly, duplicate and unnecessary information can
add complexity and complicate entangled spatio-temporal modeling. To address
the above issues, we propose an innovative heuristic architecture called
Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture
recognition. The proposed MFST model comprises a 3D Central Difference
Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal
stages. The CDC-Stem enriches fine-grained temporal perception, and the
multiple hierarchical spatio-temporal stages construct dimension-independent
higher-order semantic primitives. Specifically, the CDC-Stem module captures
bottom-level spatio-temporal features and passes them successively to the
following spatio-temporal factored stages to capture the hierarchical spatial
and temporal features through the Multi- Scale Convolution and Transformer
(MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans)
block. The seamless integration of these innovative designs results in a robust
spatio-temporal representation that outperforms state-of-the-art approaches on
RGB-D action and gesture recognition datasets.Comment: ACM MM'2
- …