1,287 research outputs found
Personalized video summarization by highest quality frames
In this work, a user-centered approach has been the basis for generation of the personalized video summaries. Primarily, the video experts score and annotate the video frames during the enrichment phase. Afterwards, the frames scores for different video segments will be updated based on the captured end-users (different with video experts) priorities towards existing video scenes. Eventually, based on the pre-defined skimming time, the highest scored video frames will be extracted to be included into the personalized video summaries. In order to evaluate the effectiveness of our proposed model, we have compared the video summaries generated by our system against the results from 4 other summarization tools using different modalities
Video summarization by group scoring
In this paper a new model for user-centered video summarization is presented. Involvement of more than one expert in generating the final video summary should be regarded as the main use case for this algorithm. This approach consists of three major steps. First, the video frames are scored by a group of operators. Next, these assigned scores are averaged to produce a singular value for each frame and lastly, the highest scored video frames alongside the corresponding audio and textual contents are extracted to be inserted into the summary. The effectiveness of this approach has been evaluated by comparing the video summaries generated by this system against the results from a number of automatic summarization tools that use different modalities for abstraction
Quality-Oriented Perceptual HEVC Based on the Spatiotemporal Saliency Detection Model
Perceptual video coding (PVC) can provide a lower bitrate with the same visual quality compared with traditional H.265/high efficiency video coding (HEVC). In this work, a novel H.265/HEVC-compliant PVC framework is proposed based on the video saliency model. Firstly, both an effective and efficient spatiotemporal saliency model is used to generate a video saliency map. Secondly, a perceptual coding scheme is developed based on the saliency map. A saliency-based quantization control algorithm is proposed to reduce the bitrate. Finally, the simulation results demonstrate that the proposed perceptual coding scheme shows its superiority in objective and subjective tests, achieving up to a 9.46% bitrate reduction with negligible subjective and objective quality loss. The advantage of the proposed method is the high quality adapted for a high-definition video application
Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
360 video saliency detection is one of the challenging benchmarks for
360 video understanding since non-negligible distortion and
discontinuity occur in the projection of any format of 360 videos, and
capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature.
We present a new framework named Panoramic Vision Transformer (PAVER). We
design the encoder using Vision Transformer with deformable convolution, which
enables us not only to plug pretrained models from normal videos into our
architecture without additional modules or finetuning but also to perform
geometric approximation only once, unlike previous deep CNN-based approaches.
Thanks to its powerful encoder, PAVER can learn the saliency from three simple
relative relations among local patch features, outperforming state-of-the-art
models for the Wild360 benchmark by large margins without supervision or
auxiliary information like class activation. We demonstrate the utility of our
saliency prediction model with the omnidirectional video quality assessment
task in VQA-ODV, where we consistently improve performance without any form of
supervision, including head movement.Comment: Published to ECCV202
Video Salient Object Detection via Fully Convolutional Networks
This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: 1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data and 2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image data sets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the densely annotated video segmentation data set (MAE of .06) and the Freiburg-Berkeley Motion Segmentation data set (MAE of .07), and do so with much improved speed (2 fps with all steps)
- …