7,780 research outputs found
SELF-VS: Self-supervised Encoding Learning For Video Summarization
Despite its wide range of applications, video summarization is still held
back by the scarcity of extensive datasets, largely due to the labor-intensive
and costly nature of frame-level annotations. As a result, existing video
summarization methods are prone to overfitting. To mitigate this challenge, we
propose a novel self-supervised video representation learning method using
knowledge distillation to pre-train a transformer encoder. Our method matches
its semantic video representation, which is constructed with respect to frame
importance scores, to a representation derived from a CNN trained on video
classification. Empirical evaluations on correlation-based metrics, such as
Kendall's and Spearman's demonstrate the superiority of our
approach compared to existing state-of-the-art methods in assigning relative
scores to the input frames.Comment: 9 pages, 5 figure
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Video summarization intends to produce a concise video summary by effectively
capturing and combining the most informative parts of the whole content.
Existing approaches for video summarization regard the task as a frame-wise
keyframe selection problem and generally construct the frame-wise
representation by combining the long-range temporal dependency with the
unimodal or bimodal information. However, the optimal video summaries need to
reflect the most valuable keyframe with its own information, and one with
semantic power of the whole content. Thus, it is critical to construct a more
powerful and robust frame-wise representation and predict the frame-level
importance score in a fair and comprehensive manner. To tackle the above
issues, we propose a multimodal hierarchical shot-aware convolutional network,
denoted as MHSCNet, to enhance the frame-wise representation via combining the
comprehensive available multimodal information. Specifically, we design a
hierarchical ShotConv network to incorporate the adaptive shot-aware
frame-level representation by considering the short-range and long-range
temporal dependency. Based on the learned shot-aware representations, MHSCNet
can predict the frame-level importance score in the local and global view of
the video. Extensive experiments on two standard video summarization datasets
demonstrate that our proposed method consistently outperforms state-of-the-art
baselines. Source code will be made publicly available
- …