2,217 research outputs found
An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos
In this work, we present an integrated system for spatiotemporal
summarization of 360-degrees videos. The video summary production mainly
involves the detection of salient events and their synopsis into a concise
summary. The analysis relies on state-of-the-art methods for saliency detection
in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It
also contains a mechanism that classifies a 360-degrees video based on the use
of static or moving camera during recording and decides which saliency
detection method will be used, as well as a 2D video production component that
is responsible to create a conventional 2D video containing the salient events
in the 360-degrees video. Quantitative evaluations using two datasets for
360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the
accuracy and positive impact of the developed decision mechanism, and justify
our choice to use two different methods for detecting the salient events. A
qualitative analysis using content from these datasets, gives further insights
about the functionality of the decision mechanism, shows the pros and cons of
each used saliency detection method and demonstrates the advanced performance
of the trained summarization method against a more conventional approach.Comment: Accepted for publication, 30th Int. Conf. on MultiMedia Modeling (MMM
2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript"
versio
Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention
We propose a method to detect individualized highlights for users on given
target videos based on their preferred highlight clips marked on previous
videos they have watched. Our method explicitly leverages the contents of both
the preferred clips and the target videos using pre-trained features for the
objects and the human activities. We design a multi-head attention mechanism to
adaptively weigh the preferred clips based on their object- and
human-activity-based contents, and fuse them using these weights into a single
feature representation for each user. We compute similarities between these
per-user feature representations and the per-frame features computed from the
desired target videos to estimate the user-specific highlight clips from the
target videos. We test our method on a large-scale highlight detection dataset
containing the annotated highlights of individual users. Compared to current
baselines, we observe an absolute improvement of 2-4% in the mean average
precision of the detected highlights. We also perform extensive ablation
experiments on the number of preferred highlight clips associated with each
user as well as on the object- and human-activity-based feature representations
to validate that our method is indeed both content-based and user-specific.Comment: 14 pages, 5 figures, 7 table
- …