Search CORE

15,061 research outputs found

MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

Author: Bhat Vimal
Fan David
Li Xinyu
MV Rohith
Sadoughi Najmeh
Santos-Villalobos Hector
Shuai Bing
Vajpayee Avijit
Publication venue
Publication date: 22/08/2023
Field of study

Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total Agreement improvement of +5.51%)Comment: ICCV 2023 accepte

arXiv.org e-Print Archive

Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition

Author: A Todorov
A Vinciarelli
A Vinciarelli
AG Wright
Arulkumar Subramaniam
B Schuller
CY Olivola
F Mairesse
GL Lorenzo
J Schmidhuber
J Willis
JI Biel
Kaiming He
L Teijeiro-Mosquera
LP Naumann
N Srivastava
P Borkenau
RJW Vernon
S Hochreiter
Y LeCun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/09/2016
Field of study

Here, we develop an audiovisual deep residual network for multimodal apparent personality trait recognition. The network is trained end-to-end for predicting the Big Five personality traits of people from their videos. That is, the network does not require any feature engineering or visual analysis such as face detection, face landmark alignment or facial expression recognition. Recently, the network won the third place in the ChaLearn First Impressions Challenge with a test accuracy of 0.9109

arXiv.org e-Print Archive

Crossref

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Author: Ben-Shabat Yizhak
Cherian Anoop
Gould Stephen
Liu Yanbin
Rodriguez Cristian
Zhang Jiahao
Publication venue
Publication date: 27/03/2023
Field of study

Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.Comment: Project website: https://academic.davidz.cn/en/publication/zhang-cvpr-2023

arXiv.org e-Print Archive

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

Author: Bui Trung
Dernoncourt Franck
Jin Hailin
Li Bo
Qiu Jielin
Wang Zhaowen
Xu Mengdi
Zhao Ding
Zhu Jiacheng
Publication venue
Publication date: 10/10/2022
Field of study

Multimedia summarization with multimodal output (MSMO) is a recently explored application in language grounding. It plays an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. However, existing methods extract features from the whole video and article and use fusion methods to select the representative one, thus usually ignoring the critical structure and varying semantics. In this work, we propose a Semantics-Consistent Cross-domain Summarization (SCCS) model based on optimal transport alignment with visual and textual segmentation. In specific, our method first decomposes both video and article into segments in order to capture the structural semantics, respectively. Then SCCS follows a cross-domain alignment objective with optimal transport distance, which leverages multimodal interaction to match and select the visual and textual summary. We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries

arXiv.org e-Print Archive

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Author: Hasan Md Kamrul
Hoque
Mohammed
Morency Louis-Philippe
Rahman Wasifur
Tanveer Md Iftekhar
Zadeh Amir
Zhong Jianyuan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research

arXiv.org e-Print Archive

Crossref