1,131 research outputs found
Context-aware Synthesis for Video Frame Interpolation
Video frame interpolation algorithms typically estimate optical flow or its
variations and then use it to guide the synthesis of an intermediate frame
between two consecutive original frames. To handle challenges like occlusion,
bidirectional flow between the two input frames is often estimated and used to
warp and blend the input frames. However, how to effectively blend the two
warped frames still remains a challenging problem. This paper presents a
context-aware synthesis approach that warps not only the input frames but also
their pixel-wise contextual information and uses them to interpolate a
high-quality intermediate frame. Specifically, we first use a pre-trained
neural network to extract per-pixel contextual information for input frames. We
then employ a state-of-the-art optical flow algorithm to estimate bidirectional
flow between them and pre-warp both input frames and their context maps.
Finally, unlike common approaches that blend the pre-warped frames, our method
feeds them and their context maps to a video frame synthesis neural network to
produce the interpolated frame in a context-aware fashion. Our neural network
is fully convolutional and is trained end to end. Our experiments show that our
method can handle challenging scenarios such as occlusion and large motion and
outperforms representative state-of-the-art approaches.Comment: CVPR 2018, http://graphics.cs.pdx.edu/project/ctxsy
Understanding Deformable Alignment in Video Super-Resolution
Deformable convolution, originally proposed for the adaptation to geometric
variations of objects, has recently shown compelling performance in aligning
multiple frames and is increasingly adopted for video super-resolution. Despite
its remarkable performance, its underlying mechanism for alignment remains
unclear. In this study, we carefully investigate the relation between
deformable alignment and the classic flow-based alignment. We show that
deformable convolution can be decomposed into a combination of spatial warping
and convolution. This decomposition reveals the commonality of deformable
alignment and flow-based alignment in formulation, but with a key difference in
their offset diversity. We further demonstrate through experiments that the
increased diversity in deformable alignment yields better-aligned features, and
hence significantly improves the quality of video super-resolution output.
Based on our observations, we propose an offset-fidelity loss that guides the
offset learning with optical flow. Experiments show that our loss successfully
avoids the overflow of offsets and alleviates the instability problem of
deformable alignment. Aside from the contributions to deformable alignment, our
formulation inspires a more flexible approach to introduce offset diversity to
flow-based alignment, improving its performance.Comment: Tech report, 15 pages, 19 figure
Uncertainty-Guided Spatial Pruning Architecture for Efficient Frame Interpolation
The video frame interpolation (VFI) model applies the convolution operation
to all locations, leading to redundant computations in regions with easy
motion. We can use dynamic spatial pruning method to skip redundant
computation, but this method cannot properly identify easy regions in VFI tasks
without supervision. In this paper, we develop an Uncertainty-Guided Spatial
Pruning (UGSP) architecture to skip redundant computation for efficient frame
interpolation dynamically. Specifically, pixels with low uncertainty indicate
easy regions, where the calculation can be reduced without bringing undesirable
visual results. Therefore, we utilize uncertainty-generated mask labels to
guide our UGSP in properly locating the easy region. Furthermore, we propose a
self-contrast training strategy that leverages an auxiliary non-pruning branch
to improve the performance of our UGSP. Extensive experiments show that UGSP
maintains performance but reduces FLOPs by 34%/52%/30% compared to baseline
without pruning on Vimeo90K/UCF101/MiddleBury datasets. In addition, our method
achieves state-of-the-art performance with lower FLOPs on multiple benchmarks.Comment: ACM Multimedia 202
Shallow Features Guide Unsupervised Domain Adaptation for Semantic Segmentation at Class Boundaries
Although deep neural networks have achieved remarkable results for the task of semantic segmentation, they usually fail to generalize towards new domains, especially when performing synthetic-to-real adaptation. Such domain shift is particularly noticeable along class boundaries, invalidating one of the main goals of semantic segmentation that consists in obtaining sharp segmentation masks.In this work, we specifically address this core problem in the context of Unsupervised Domain Adaptation and present a novel low-level adaptation strategy that allows us to obtain sharp predictions. Moreover, inspired by recent self-training techniques, we introduce an effective data augmentation that alleviates the noise typically present at semantic boundaries when employing pseudo-labels for self-training. Our contributions can be easily integrated into other popular adaptation frameworks, and extensive experiments show that they effectively improve performance along class boundaries
Task Agnostic Restoration of Natural Video Dynamics
In many video restoration/translation tasks, image processing operations are
na\"ively extended to the video domain by processing each frame independently,
disregarding the temporal connection of the video frames. This disregard for
the temporal connection often leads to severe temporal inconsistencies.
State-Of-The-Art (SOTA) techniques that address these inconsistencies rely on
the availability of unprocessed videos to implicitly siphon and utilize
consistent video dynamics to restore the temporal consistency of frame-wise
processed videos which often jeopardizes the translation effect. We propose a
general framework for this task that learns to infer and utilize consistent
motion dynamics from inconsistent videos to mitigate the temporal flicker while
preserving the perceptual quality for both the temporally neighboring and
relatively distant frames without requiring the raw videos at test time. The
proposed framework produces SOTA results on two benchmark datasets, DAVIS and
videvo.net, processed by numerous image processing applications. The code and
the trained models are available at
\url{https://github.com/MKashifAli/TARONVD}
Model based methods for locating, enhancing and recognising low resolution objects in video
Visual perception is our most important sense which enables us to detect and recognise objects even in low detail video scenes. While humans are able to perform such object detection and recognition tasks reliably, most computer vision algorithms struggle with wide angle surveillance videos that make automatic processing difficult due to low resolution and poor detail objects. Additional problems arise from varying pose and lighting conditions as well as non-cooperative subjects. All these constraints pose problems for automatic scene interpretation of surveillance video, including object detection, tracking and object recognition.Therefore, the aim of this thesis is to detect, enhance and recognise objects by incorporating a priori information and by using model based approaches. Motivated by the increasing demand for automatic methods for object detection, enhancement and recognition in video surveillance, different aspects of the video processing task are investigated with a focus on human faces. In particular, the challenge of fully automatic face pose and shape estimation by fitting a deformable 3D generic face model under varying pose and lighting conditions is tackled. Principal Component Analysis (PCA) is utilised to build an appearance model that is then used within a particle filter based approach to fit the 3D face mask to the image. This recovers face pose and person-specific shape information simultaneously. Experiments demonstrate the use in different resolution and under varying pose and lighting conditions. Following that, a combined tracking and super resolution approach enhances the quality of poor detail video objects. A 3D object mask is subdivided such that every mask triangle is smaller than a pixel when projected into the image and then used for model based tracking. The mask subdivision then allows for super resolution of the object by combining several video frames. This approach achieves better results than traditional super resolution methods without the use of interpolation or deblurring.Lastly, object recognition is performed in two different ways. The first recognition method is applied to characters and used for license plate recognition. A novel character model is proposed to create different appearances which are then matched with the image of unknown characters for recognition. This allows for simultaneous character segmentation and recognition and high recognition rates are achieved for low resolution characters down to only five pixels in size. While this approach is only feasible for objects with a limited number of different appearances, like characters, the second recognition method is applicable to any object, including human faces. Therefore, a generic 3D face model is automatically fitted to an image of a human face and recognition is performed on a mask level rather than image level. This approach does not require an initial pose estimation nor the selection of feature points, the face alignment is provided implicitly by the mask fitting process
- …