2,859 research outputs found
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
Previous research has studied the task of segmenting cinematic videos into
scenes and into narrative acts. However, these studies have overlooked the
essential task of multimodal alignment and fusion for effectively and
efficiently processing long-form videos (>60min). In this paper, we introduce
Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic
long-video segmentation. MEGA tackles the challenge by leveraging multiple
media modalities. The method coarsely aligns inputs of variable lengths and
different modalities with alignment positional encoding. To maintain temporal
synchronization while reducing computation, we further introduce an enhanced
bottleneck fusion layer which uses temporal alignment. Additionally, MEGA
employs a novel contrastive loss to synchronize and transfer labels across
modalities, enabling act segmentation from labeled synopsis sentences on video
shots. Our experimental results show that MEGA outperforms state-of-the-art
methods on MovieNet dataset for scene segmentation (with an Average Precision
improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total
Agreement improvement of +5.51%)Comment: ICCV 2023 accepte
Scene Consistency Representation Learning for Video Scene Segmentation
A long-term video, such as a movie or TV show, is composed of various scenes,
each of which represents a series of shots sharing the same semantic story.
Spotting the correct scene boundary from the long-term video is a challenging
task, since a model must understand the storyline of the video to figure out
where a scene starts and ends. To this end, we propose an effective
Self-Supervised Learning (SSL) framework to learn better shot representations
from unlabeled long-term videos. More specifically, we present an SSL scheme to
achieve scene consistency, while exploring considerable data augmentation and
shuffling methods to boost the model generalizability. Instead of explicitly
learning the scene boundary features as in the previous methods, we introduce a
vanilla temporal model with less inductive bias to verify the quality of the
shot features. Our method achieves the state-of-the-art performance on the task
of Video Scene Segmentation. Additionally, we suggest a more fair and
reasonable benchmark to evaluate the performance of Video Scene Segmentation
methods. The code is made available.Comment: Accepted to CVPR 202
A Multi-Stream Approach for Video Understanding
The automatic annotation of higher-level semantic information in long-form video content is still a challenging task. The Deep Video Understanding (DVU) Challenge aims at catalyzing progress in this area by offering common data and tasks. In this paper, we present our contribution to the 3rd DVU challenge. Our approach consists of multiple information streams extracted from both the visual and the audio modality. The streams can build on information generated by previous streams to increase their semantic descriptiveness. Finally, the output of all streams can be aggregated in order to produce a graph representation of the input movie to represent the semantic relationships between the relevant characters
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Video-grounded dialogue understanding is a challenging problem that requires
machine to perceive, parse and reason over situated semantics extracted from
weakly aligned video and dialogues. Most existing benchmarks treat both
modalities the same as a frame-independent visual understanding task, while
neglecting the intrinsic attributes in multimodal dialogues, such as scene and
topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe
dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding
dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for
video-grounded dialogue understanding: scene segmentation and topic
segmentation, and one benchmark for video-grounded dialogue generation.
Comprehensive experiments are performed on these benchmarks to demonstrate the
importance of multimodal information and segments in video-grounded dialogue
understanding and generation.Comment: To appear at ACL 202
Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies
Movie highlights stand out of the screenplay for efficient browsing and play
a crucial role on social media platforms. Based on existing efforts, this work
has two observations: (1) For different annotators, labeling highlight has
uncertainty, which leads to inaccurate and time-consuming annotations. (2)
Besides previous supervised or unsupervised settings, some existing video
corpora can be useful, e.g., trailers, but they are often noisy and incomplete
to cover the full highlights. In this work, we study a more practical and
promising setting, i.e., reformulating highlight detection as "learning with
noisy labels". This setting does not require time-consuming manual annotations
and can fully utilize existing abundant video corpora. First, based on movie
trailers, we leverage scene segmentation to obtain complete shots, which are
regarded as noisy labels. Then, we propose a Collaborative noisy Label Cleaner
(CLC) framework to learn from noisy highlight moments. CLC consists of two
modules: augmented cross-propagation (ACP) and multi-modality cleaning (MMC).
The former aims to exploit the closely related audio-visual signals and fuse
them to learn unified multi-modal representations. The latter aims to achieve
cleaner highlight labels by observing the changes in losses among different
modalities. To verify the effectiveness of CLC, we further collect a
large-scale highlight dataset named MovieLights. Comprehensive experiments on
MovieLights and YouTube Highlights datasets demonstrate the effectiveness of
our approach. Code has been made available at:
https://github.com/TencentYoutuResearch/HighlightDetection-CLCComment: Accepted to CVPR202
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation
Temporal video segmentation and classification have been advanced greatly by
public benchmarks in recent years. However, such research still mainly focuses
on human actions, failing to describe videos in a holistic view. In addition,
previous research tends to pay much attention to visual information yet ignores
the multi-modal nature of videos. To fill this gap, we construct the Tencent
`Ads Video Segmentation'~(TAVS) dataset in the ads domain to escalate
multi-modal video analysis to a new level. TAVS describes videos from three
independent perspectives as `presentation form', `place', and `style', and
contains rich multi-modal information such as video, audio, and text. TAVS is
organized hierarchically in semantic aspects for comprehensive temporal video
segmentation with three levels of categories for multi-label classification,
e.g., `place' - `working place' - `office'. Therefore, TAVS is distinguished
from previous temporal segmentation datasets due to its multi-modal
information, holistic view of categories, and hierarchical granularities. It
includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500
labels. Accompanied with TAVS, we also present a strong multi-modal video
segmentation baseline coupled with multi-label class prediction. Extensive
experiments are conducted to evaluate our proposed method as well as existing
representative methods to reveal key challenges of our dataset TAVS
- …