15,061 research outputs found
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
Previous research has studied the task of segmenting cinematic videos into
scenes and into narrative acts. However, these studies have overlooked the
essential task of multimodal alignment and fusion for effectively and
efficiently processing long-form videos (>60min). In this paper, we introduce
Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic
long-video segmentation. MEGA tackles the challenge by leveraging multiple
media modalities. The method coarsely aligns inputs of variable lengths and
different modalities with alignment positional encoding. To maintain temporal
synchronization while reducing computation, we further introduce an enhanced
bottleneck fusion layer which uses temporal alignment. Additionally, MEGA
employs a novel contrastive loss to synchronize and transfer labels across
modalities, enabling act segmentation from labeled synopsis sentences on video
shots. Our experimental results show that MEGA outperforms state-of-the-art
methods on MovieNet dataset for scene segmentation (with an Average Precision
improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total
Agreement improvement of +5.51%)Comment: ICCV 2023 accepte
Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition
Here, we develop an audiovisual deep residual network for multimodal apparent
personality trait recognition. The network is trained end-to-end for predicting
the Big Five personality traits of people from their videos. That is, the
network does not require any feature engineering or visual analysis such as
face detection, face landmark alignment or facial expression recognition.
Recently, the network won the third place in the ChaLearn First Impressions
Challenge with a test accuracy of 0.9109
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Multimodal alignment facilitates the retrieval of instances from one modality
when queried using another. In this paper, we consider a novel setting where
such an alignment is between (i) instruction steps that are depicted as
assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video
segments from in-the-wild videos; these videos comprising an enactment of the
assembly actions in the real world. To learn this alignment, we introduce a
novel supervised contrastive learning method that learns to align videos with
the subtle details in the assembly diagrams, guided by a set of novel losses.
To study this problem and demonstrate the effectiveness of our method, we
introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183
hours of videos from diverse furniture assembly collections and nearly 8,300
illustrations from their associated instruction manuals and annotated for their
ground truth alignments. We define two tasks on this dataset: First, nearest
neighbor retrieval between video segments and illustrations, and, second,
alignment of instruction steps and the segments for each video. Extensive
experiments on IAW demonstrate superior performances of our approach against
alternatives.Comment: Project website:
https://academic.davidz.cn/en/publication/zhang-cvpr-2023
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment
Multimedia summarization with multimodal output (MSMO) is a recently explored
application in language grounding. It plays an essential role in real-world
applications, i.e., automatically generating cover images and titles for news
articles or providing introductions to online videos. However, existing methods
extract features from the whole video and article and use fusion methods to
select the representative one, thus usually ignoring the critical structure and
varying semantics. In this work, we propose a Semantics-Consistent Cross-domain
Summarization (SCCS) model based on optimal transport alignment with visual and
textual segmentation. In specific, our method first decomposes both video and
article into segments in order to capture the structural semantics,
respectively. Then SCCS follows a cross-domain alignment objective with optimal
transport distance, which leverages multimodal interaction to match and select
the visual and textual summary. We evaluated our method on three recent
multimodal datasets and demonstrated the effectiveness of our method in
producing high-quality multimodal summaries
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Humor is a unique and creative communicative behavior displayed during social
interactions. It is produced in a multimodal manner, through the usage of words
(text), gestures (vision) and prosodic cues (acoustic). Understanding humor
from these three modalities falls within boundaries of multimodal language; a
recent research trend in natural language processing that models natural
language as it happens in face-to-face communication. Although humor detection
is an established research area in NLP, in a multimodal context it is an
understudied area. This paper presents a diverse multimodal dataset, called
UR-FUNNY, to open the door to understanding multimodal language used in
expressing humor. The dataset and accompanying studies, present a framework in
multimodal humor detection for the natural language processing community.
UR-FUNNY is publicly available for research
- …