40,837 research outputs found
Video Super-resolution with Temporal Group Attention
Video super-resolution, which aims at producing a high-resolution video from
its corresponding low-resolution version, has recently drawn increasing
attention. In this work, we propose a novel method that can effectively
incorporate temporal information in a hierarchical way. The input sequence is
divided into several groups, with each one corresponding to a kind of frame
rate. These groups provide complementary information to recover missing details
in the reference frame, which is further integrated with an attention module
and a deep intra-group fusion module. In addition, a fast spatial alignment is
proposed to handle videos with large motion. Extensive results demonstrate the
capability of the proposed model in handling videos with various motion. It
achieves favorable performance against state-of-the-art methods on several
benchmark datasets.Comment: CVPR 202
Group-based Bi-Directional Recurrent Wavelet Neural Networks for Video Super-Resolution
Video super-resolution (VSR) aims to estimate a high-resolution (HR) frame
from a low-resolution (LR) frames. The key challenge for VSR lies in the
effective exploitation of spatial correlation in an intra-frame and temporal
dependency between consecutive frames. However, most of the previous methods
treat different types of the spatial features identically and extract spatial
and temporal features from the separated modules. It leads to lack of obtaining
meaningful information and enhancing the fine details. In VSR, there are three
types of temporal modeling frameworks: 2D convolutional neural networks (CNN),
3D CNN, and recurrent neural networks (RNN). Among them, the RNN-based approach
is suitable for sequential data. Thus the SR performance can be greatly
improved by using the hidden states of adjacent frames. However, at each of
time step in a recurrent structure, the RNN-based previous works utilize the
neighboring features restrictively. Since the range of accessible motion per
time step is narrow, there are still limitations to restore the missing details
for dynamic or large motion. In this paper, we propose a group-based
bi-directional recurrent wavelet neural networks (GBR-WNN) to exploit the
sequential data and spatio-temporal information effectively for VSR. The
proposed group-based bi-directional RNN (GBR) temporal modeling framework is
built on the well-structured process with the group of pictures (GOP). We
propose a temporal wavelet attention (TWA) module, in which attention is
adopted for both spatial and temporal features. Experimental results
demonstrate that the proposed method achieves superior performance compared
with state-of-the-art methods in both of quantitative and qualitative
evaluations.Comment: 10 pages, 5 figure
EDVR: Video Restoration with Enhanced Deformable Convolutional Networks
Video restoration tasks, including super-resolution, deblurring, etc, are
drawing increasing attention in the computer vision community. A challenging
benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark
challenges existing methods from two aspects: (1) how to align multiple frames
given large motions, and (2) how to effectively fuse different frames with
diverse motion and blur. In this work, we propose a novel Video Restoration
framework with Enhanced Deformable networks, termed EDVR, to address these
challenges. First, to handle large motions, we devise a Pyramid, Cascading and
Deformable (PCD) alignment module, in which frame alignment is done at the
feature level using deformable convolutions in a coarse-to-fine manner. Second,
we propose a Temporal and Spatial Attention (TSA) fusion module, in which
attention is applied both temporally and spatially, so as to emphasize
important features for subsequent restoration. Thanks to these modules, our
EDVR wins the champions and outperforms the second place by a large margin in
all four tracks in the NTIRE19 video restoration and enhancement challenges.
EDVR also demonstrates superior performance to state-of-the-art published
methods on video super-resolution and deblurring. The code is available at
https://github.com/xinntao/EDVR.Comment: To appear in CVPR 2019 Workshop. The winners in all four tracks in
the NTIRE 2019 video restoration and enhancement challenges. Project page:
https://xinntao.github.io/projects/EDVR , Code:
https://github.com/xinntao/EDV
3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks
In video super-resolution, the spatio-temporal coherence between, and among
the frames must be exploited appropriately for accurate prediction of the high
resolution frames. Although 2D convolutional neural networks (CNNs) are
powerful in modelling images, 3D-CNNs are more suitable for spatio-temporal
feature extraction as they can preserve temporal information. To this end, we
propose an effective 3D-CNN for video super-resolution, called the 3DSRnet that
does not require motion alignment as preprocessing. Our 3DSRnet maintains the
temporal depth of spatio-temporal feature maps to maximally capture the
temporally nonlinear characteristics between low and high resolution frames,
and adopts residual learning in conjunction with the sub-pixel outputs. It
outperforms the most state-of-the-art method with average 0.45 and 0.36 dB
higher in PSNR for scales 3 and 4, respectively, in the Vidset4 benchmark. Our
3DSRnet first deals with the performance drop due to scene change, which is
important in practice but has not been previously considered.Comment: Extension of our paper accepted at ICIP 201
Learning Parallax Attention for Stereo Image Super-Resolution
Stereo image pairs can be used to improve the performance of super-resolution
(SR) since additional information is provided from a second viewpoint. However,
it is challenging to incorporate this information for SR since disparities
between stereo images vary significantly. In this paper, we propose a
parallax-attention stereo superresolution network (PASSRnet) to integrate the
information from a stereo image pair for SR. Specifically, we introduce a
parallax-attention mechanism with a global receptive field along the epipolar
line to handle different stereo images with large disparity variations. We also
propose a new and the largest dataset for stereo image SR (namely, Flickr1024).
Extensive experiments demonstrate that the parallax-attention mechanism can
capture correspondence between stereo images to improve SR performance with a
small computational and memory cost. Comparative results show that our PASSRnet
achieves the state-of-the-art performance on the Middlebury, KITTI 2012 and
KITTI 2015 datasets.Comment: To appear in CVPR 201
Temporal Gaussian Mixture Layer for Videos
We introduce a new convolutional layer named the Temporal Gaussian Mixture
(TGM) layer and present how it can be used to efficiently capture longer-term
temporal information in continuous activity videos. The TGM layer is a temporal
convolutional layer governed by a much smaller set of parameters (e.g.,
location/variance of Gaussians) that are fully differentiable. We present our
fully convolutional video models with multiple TGM layers for activity
detection. The extensive experiments on multiple datasets, including Charades
and MultiTHUMOS, confirm the effectiveness of TGM layers, significantly
outperforming the state-of-the-arts.Comment: ICML 201
Down-Scaling with Learned Kernels in Multi-Scale Deep Neural Networks for Non-Uniform Single Image Deblurring
Multi-scale approach has been used for blind image / video deblurring
problems to yield excellent performance for both conventional and recent
deep-learning-based state-of-the-art methods. Bicubic down-sampling is a
typical choice for multi-scale approach to reduce spatial dimension after
filtering with a fixed kernel. However, this fixed kernel may be sub-optimal
since it may destroy important information for reliable deblurring such as
strong edges. We propose convolutional neural network (CNN)-based down-scale
methods for multi-scale deep-learning-based non-uniform single image
deblurring. We argue that our CNN-based down-scaling effectively reduces the
spatial dimension of the original image, while learned kernels with multiple
channels may well-preserve necessary details for deblurring tasks. For each
scale, we adopt to use RCAN (Residual Channel Attention Networks) as a backbone
network to further improve performance. Our proposed method yielded
state-of-the-art performance on GoPro dataset by large margin. Our proposed
method was able to achieve 2.59dB higher PSNR than the current state-of-the-art
method by Tao. Our proposed CNN-based down-scaling was the key factor for this
excellent performance since the performance of our network without it was
decreased by 1.98dB. The same networks trained with GoPro set were also
evaluated on large-scale Su dataset and our proposed method yielded 1.15dB
better PSNR than the Tao's method. Qualitative comparisons on Lai dataset also
confirmed the superior performance of our proposed method over other
state-of-the-art methods.Comment: 10 pages, 7 figures, 4 table
VORNet: Spatio-temporally Consistent Video Inpainting for Object Removal
Video object removal is a challenging task in video processing that often
requires massive human efforts. Given the mask of the foreground object in each
frame, the goal is to complete (inpaint) the object region and generate a video
without the target object. While recently deep learning based methods have
achieved great success on the image inpainting task, they often lead to
inconsistent results between frames when applied to videos. In this work, we
propose a novel learning-based Video Object Removal Network (VORNet) to solve
the video object removal task in a spatio-temporally consistent manner, by
combining the optical flow warping and image-based inpainting model.
Experiments are done on our Synthesized Video Object Removal (SVOR) dataset
based on the YouTube-VOS video segmentation dataset, and both the objective and
subjective evaluation demonstrate that our VORNet generates more spatially and
temporally consistent videos compared with existing methods.Comment: Accepted to CVPRW 201
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Super-Resolution via Deep Learning
The recent phenomenal interest in convolutional neural networks (CNNs) must
have made it inevitable for the super-resolution (SR) community to explore its
potential. The response has been immense and in the last three years, since the
advent of the pioneering work, there appeared too many works not to warrant a
comprehensive survey. This paper surveys the SR literature in the context of
deep learning. We focus on the three important aspects of multimedia - namely
image, video and multi-dimensions, especially depth maps. In each case, first
relevant benchmarks are introduced in the form of datasets and state of the art
SR methods, excluding deep learning. Next is a detailed analysis of the
individual works, each including a short description of the method and a
critique of the results with special reference to the benchmarking done. This
is followed by minimum overall benchmarking in the form of comparison on some
common dataset, while relying on the results reported in various works
- …