3,257 research outputs found
Towards Semantic Fast-Forward and Stabilized Egocentric Videos
The emergence of low-cost personal mobiles devices and wearable cameras and
the increasing storage capacity of video-sharing websites have pushed forward a
growing interest towards first-person videos. Since most of the recorded videos
compose long-running streams with unedited content, they are tedious and
unpleasant to watch. The fast-forward state-of-the-art methods are facing
challenges of balancing the smoothness of the video and the emphasis in the
relevant frames given a speed-up rate. In this work, we present a methodology
capable of summarizing and stabilizing egocentric videos by extracting the
semantic information from the frames. This paper also describes a dataset
collection with several semantically labeled videos and introduces a new
smoothness evaluation metric for egocentric videos that is used to test our
method.Comment: Accepted for publication and presented in the First International
Workshop on Egocentric Perception, Interaction and Computing at European
Conference on Computer Vision (EPIC@ECCV) 201
Long-Term Human Video Generation of Multiple Futures Using Poses
Predicting future human behavior from an input human video is a useful task
for applications such as autonomous driving and robotics. While most previous
works predict a single future, multiple futures with different behavior can
potentially occur. Moreover, if the predicted future is too short (e.g., less
than one second), it may not be fully usable by a human or other systems. In
this paper, we propose a novel method for future human pose prediction capable
of predicting multiple long-term futures. This makes the predictions more
suitable for real applications. Also, from the input video and the predicted
human behavior, we generate future videos. First, from an input human video, we
generate sequences of future human poses (i.e., the image coordinates of their
body-joints) via adversarial learning. Adversarial learning suffers from mode
collapse, which makes it difficult to generate a variety of multiple poses. We
solve this problem by utilizing two additional inputs to the generator to make
the outputs diverse, namely, a latent code (to reflect various behaviors) and
an attraction point (to reflect various trajectories). In addition, we generate
long-term future human poses using a novel approach based on unidimensional
convolutional neural networks. Last, we generate an output video based on the
generated poses for visualization. We evaluate the generated future poses and
videos using three criteria (i.e., realism, diversity and accuracy), and show
that our proposed method outperforms other state-of-the-art works
Fast-Forward Video Based on Semantic Extraction
Thanks to the low operational cost and large storage capacity of smartphones
and wearable devices, people are recording many hours of daily activities,
sport actions and home videos. These videos, also known as egocentric videos,
are generally long-running streams with unedited content, which make them
boring and visually unpalatable, bringing up the challenge to make egocentric
videos more appealing. In this work we propose a novel methodology to compose
the new fast-forward video by selecting frames based on semantic information
extracted from images. The experiments show that our approach outperforms the
state-of-the-art as far as semantic information is concerned and that it is
also able to produce videos that are more pleasant to be watched.Comment: Accepted for publication and presented in 2016 IEEE International
Conference on Image Processing (ICIP
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Video generation necessitates both global coherence and local realism. This
work presents a novel non-autoregressive method GLOBER, which first generates
global features to obtain comprehensive global guidance and then synthesizes
video frames based on the global features to generate coherent videos.
Specifically, we propose a video auto-encoder, where a video encoder encodes
videos into global features, and a video decoder, built on a diffusion model,
decodes the global features and synthesizes video frames in a
non-autoregressive manner. To achieve maximum flexibility, our video decoder
perceives temporal information through normalized frame indexes, which enables
it to synthesize arbitrary sub video clips with predetermined starting and
ending frame indexes. Moreover, a novel adversarial loss is introduced to
improve the global coherence and local realism between the synthesized video
frames. Finally, we employ a diffusion-based video generator to fit the global
features outputted by the video encoder for video generation. Extensive
experimental results demonstrate the effectiveness and efficiency of our
proposed method, and new state-of-the-art results have been achieved on
multiple benchmarks
- …