6,102 research outputs found
Perceptual Video Super Resolution with Enhanced Temporal Consistency
With the advent of perceptual loss functions, new possibilities in
super-resolution have emerged, and we currently have models that successfully
generate near-photorealistic high-resolution images from their low-resolution
observations. Up to now, however, such approaches have been exclusively limited
to single image super-resolution. The application of perceptual loss functions
on video processing still entails several challenges, mostly related to the
lack of temporal consistency of the generated images, i.e., flickering
artifacts. In this work, we present a novel adversarial recurrent network for
video upscaling that is able to produce realistic textures in a temporally
consistent way. The proposed architecture naturally leverages information from
previous frames due to its recurrent architecture, i.e. the input to the
generator is composed of the low-resolution image and, additionally, the warped
output of the network at the previous step. Together with a video
discriminator, we also propose additional loss functions to further reinforce
temporal consistency in the generated sequences. The experimental validation of
our algorithm shows the effectiveness of our approach which obtains images with
high perceptual quality and improved temporal consistency.Comment: Major revision and improvement of the manuscript: New network
architecture, new loss function and extended experiment
EDVR: Video Restoration with Enhanced Deformable Convolutional Networks
Video restoration tasks, including super-resolution, deblurring, etc, are
drawing increasing attention in the computer vision community. A challenging
benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark
challenges existing methods from two aspects: (1) how to align multiple frames
given large motions, and (2) how to effectively fuse different frames with
diverse motion and blur. In this work, we propose a novel Video Restoration
framework with Enhanced Deformable networks, termed EDVR, to address these
challenges. First, to handle large motions, we devise a Pyramid, Cascading and
Deformable (PCD) alignment module, in which frame alignment is done at the
feature level using deformable convolutions in a coarse-to-fine manner. Second,
we propose a Temporal and Spatial Attention (TSA) fusion module, in which
attention is applied both temporally and spatially, so as to emphasize
important features for subsequent restoration. Thanks to these modules, our
EDVR wins the champions and outperforms the second place by a large margin in
all four tracks in the NTIRE19 video restoration and enhancement challenges.
EDVR also demonstrates superior performance to state-of-the-art published
methods on video super-resolution and deblurring. The code is available at
https://github.com/xinntao/EDVR.Comment: To appear in CVPR 2019 Workshop. The winners in all four tracks in
the NTIRE 2019 video restoration and enhancement challenges. Project page:
https://xinntao.github.io/projects/EDVR , Code:
https://github.com/xinntao/EDV
Adapting Image Super-Resolution State-of-the-arts and Learning Multi-model Ensemble for Video Super-Resolution
Recently, image super-resolution has been widely studied and achieved
significant progress by leveraging the power of deep convolutional neural
networks. However, there has been limited advancement in video super-resolution
(VSR) due to the complex temporal patterns in videos. In this paper, we
investigate how to adapt state-of-the-art methods of image super-resolution for
video super-resolution. The proposed adapting method is straightforward. The
information among successive frames is well exploited, while the overhead on
the original image super-resolution method is negligible. Furthermore, we
propose a learning-based method to ensemble the outputs from multiple
super-resolution models. Our methods show superior performance and rank second
in the NTIRE2019 Video Super-Resolution Challenge Track 1
Fast Spatio-Temporal Residual Network for Video Super-Resolution
Recently, deep learning based video super-resolution (SR) methods have
achieved promising performance. To simultaneously exploit the spatial and
temporal information of videos, employing 3-dimensional (3D) convolutions is a
natural approach. However, straight utilizing 3D convolutions may lead to an
excessively high computational complexity which restricts the depth of video SR
models and thus undermine the performance. In this paper, we present a novel
fast spatio-temporal residual network (FSTRN) to adopt 3D convolutions for the
video SR task in order to enhance the performance while maintaining a low
computational load. Specifically, we propose a fast spatio-temporal residual
block (FRB) that divide each 3D filter to the product of two 3D filters, which
have considerably lower dimensions. Furthermore, we design a cross-space
residual learning that directly links the low-resolution space and the
high-resolution space, which can greatly relieve the computational burden on
the feature fusion and up-scaling parts. Extensive evaluations and comparisons
on benchmark datasets validate the strengths of the proposed approach and
demonstrate that the proposed network significantly outperforms the current
state-of-the-art methods.Comment: To appear in CVPR 201
Recurrent Back-Projection Network for Video Super-Resolution
We proposed a novel architecture for the problem of video super-resolution.
We integrate spatial and temporal contexts from continuous video frames using a
recurrent encoder-decoder module, that fuses multi-frame information with the
more traditional, single frame super-resolution path for the target frame. In
contrast to most prior work where frames are pooled together by stacking or
warping, our model, the Recurrent Back-Projection Network (RBPN) treats each
context frame as a separate source of information. These sources are combined
in an iterative refinement framework inspired by the idea of back-projection in
multiple-image super-resolution. This is aided by explicitly representing
estimated inter-frame motion with respect to the target, rather than explicitly
aligning frames. We propose a new video super-resolution benchmark, allowing
evaluation at a larger scale and considering videos in different motion
regimes. Experimental results demonstrate that our RBPN is superior to existing
methods on several datasets.Comment: To appear in CVPR201
Learning for Video Super-Resolution through HR Optical Flow Estimation
Video super-resolution (SR) aims to generate a sequence of high-resolution
(HR) frames with plausible and temporally consistent details from their
low-resolution (LR) counterparts. The generation of accurate correspondence
plays a significant role in video SR. It is demonstrated by traditional video
SR methods that simultaneous SR of both images and optical flows can provide
accurate correspondences and better SR results. However, LR optical flows are
used in existing deep learning based methods for correspondence generation. In
this paper, we propose an end-to-end trainable video SR framework to
super-resolve both images and optical flows. Specifically, we first propose an
optical flow reconstruction network (OFRnet) to infer HR optical flows in a
coarse-to-fine manner. Then, motion compensation is performed according to the
HR optical flows. Finally, compensated LR inputs are fed to a super-resolution
network (SRnet) to generate the SR results. Extensive experiments demonstrate
that HR optical flows provide more accurate correspondences than their LR
counterparts and improve both accuracy and consistency performance. Comparative
results on the Vid4 and DAVIS-10 datasets show that our framework achieves the
state-of-the-art performance.Comment: To appear in ACCV 201
Image Super-Resolution via Dual-State Recurrent Networks
Advances in image super-resolution (SR) have recently benefited significantly
from rapid developments in deep neural networks. Inspired by these recent
discoveries, we note that many state-of-the-art deep SR architectures can be
reformulated as a single-state recurrent neural network (RNN) with finite
unfoldings. In this paper, we explore new structures for SR based on this
compact RNN view, leading us to a dual-state design, the Dual-State Recurrent
Network (DSRN). Compared to its single state counterparts that operate at a
fixed spatial resolution, DSRN exploits both low-resolution (LR) and
high-resolution (HR) signals jointly. Recurrent signals are exchanged between
these states in both directions (both LR to HR and HR to LR) via delayed
feedback. Extensive quantitative and qualitative evaluations on benchmark
datasets and on a recent challenge demonstrate that the proposed DSRN performs
favorably against state-of-the-art algorithms in terms of both memory
consumption and predictive accuracy
NTIRE 2020 Challenge on Image and Video Deblurring
Motion blur is one of the most common degradation artifacts in dynamic scene
photography. This paper reviews the NTIRE 2020 Challenge on Image and Video
Deblurring. In this challenge, we present the evaluation results from 3
competition tracks as well as the proposed solutions. Track 1 aims to develop
single-image deblurring methods focusing on restoration quality. On Track 2,
the image deblurring methods are executed on a mobile platform to find the
balance of the running speed and the restoration accuracy. Track 3 targets
developing video deblurring methods that exploit the temporal relation between
input frames. In each competition, there were 163, 135, and 102 registered
participants and in the final testing phase, 9, 4, and 7 teams competed. The
winning methods demonstrate the state-ofthe-art performance on image and video
deblurring tasks.Comment: To be published in CVPR 2020 Workshop (New Trends in Image
Restoration and Enhancement
Recurrent Convolutions for Causal 3D CNNs
Recently, three dimensional (3D) convolutional neural networks (CNNs) have
emerged as dominant methods to capture spatiotemporal representations in
videos, by adding to pre-existing 2D CNNs a third, temporal dimension. Such 3D
CNNs, however, are anti-causal (i.e., they exploit information from both the
past and the future frames to produce feature representations, thus preventing
their use in online settings), constrain the temporal reasoning horizon to the
size of the temporal convolution kernel, and are not temporal
resolution-preserving for video sequence-to-sequence modelling, as, for
instance, in action detection. To address these serious limitations, here we
present a new 3D CNN architecture for the causal/online processing of videos.
Namely, we propose a novel Recurrent Convolutional Network (RCN), which
relies on recurrence to capture the temporal context across frames at each
network level. Our network decomposes 3D convolutions into (1) a 2D spatial
convolution component, and (2) an additional hidden state
convolution, applied across time. The hidden state at any time is assumed
to depend on the hidden state at and on the current output of the spatial
convolution component. As a result, the proposed network: (i) produces causal
outputs, (ii) provides flexible temporal reasoning, (iii) preserves temporal
resolution. Our experiments on the large-scale large Kinetics and MultiThumos
datasets show that the proposed method performs comparably to anti-causal 3D
CNNs, while being causal and using fewer parameters.Comment: Workshop on Large Scale Holistic Video Understanding, ICCVW, 201
Down-Scaling with Learned Kernels in Multi-Scale Deep Neural Networks for Non-Uniform Single Image Deblurring
Multi-scale approach has been used for blind image / video deblurring
problems to yield excellent performance for both conventional and recent
deep-learning-based state-of-the-art methods. Bicubic down-sampling is a
typical choice for multi-scale approach to reduce spatial dimension after
filtering with a fixed kernel. However, this fixed kernel may be sub-optimal
since it may destroy important information for reliable deblurring such as
strong edges. We propose convolutional neural network (CNN)-based down-scale
methods for multi-scale deep-learning-based non-uniform single image
deblurring. We argue that our CNN-based down-scaling effectively reduces the
spatial dimension of the original image, while learned kernels with multiple
channels may well-preserve necessary details for deblurring tasks. For each
scale, we adopt to use RCAN (Residual Channel Attention Networks) as a backbone
network to further improve performance. Our proposed method yielded
state-of-the-art performance on GoPro dataset by large margin. Our proposed
method was able to achieve 2.59dB higher PSNR than the current state-of-the-art
method by Tao. Our proposed CNN-based down-scaling was the key factor for this
excellent performance since the performance of our network without it was
decreased by 1.98dB. The same networks trained with GoPro set were also
evaluated on large-scale Su dataset and our proposed method yielded 1.15dB
better PSNR than the Tao's method. Qualitative comparisons on Lai dataset also
confirmed the superior performance of our proposed method over other
state-of-the-art methods.Comment: 10 pages, 7 figures, 4 table
- …