10,447 research outputs found
Handheld Multi-Frame Super-Resolution
Compared to DSLR cameras, smartphone cameras have smaller sensors, which
limits their spatial resolution; smaller apertures, which limits their light
gathering ability; and smaller pixels, which reduces their signal-to noise
ratio. The use of color filter arrays (CFAs) requires demosaicing, which
further degrades resolution. In this paper, we supplant the use of traditional
demosaicing in single-frame and burst photography pipelines with a multiframe
super-resolution algorithm that creates a complete RGB image directly from a
burst of CFA raw images. We harness natural hand tremor, typical in handheld
photography, to acquire a burst of raw frames with small offsets. These frames
are then aligned and merged to form a single image with red, green, and blue
values at every pixel site. This approach, which includes no explicit
demosaicing step, serves to both increase image resolution and boost signal to
noise ratio. Our algorithm is robust to challenging scene conditions: local
motion, occlusion, or scene changes. It runs at 100 milliseconds per
12-megapixel RAW input burst frame on mass-produced mobile phones.
Specifically, the algorithm is the basis of the Super-Res Zoom feature, as well
as the default merge method in Night Sight mode (whether zooming or not) on
Google's flagship phone.Comment: 24 pages, accepted to Siggraph 2019 Technical Papers progra
A New Adaptive Video Super-Resolution Algorithm With Improved Robustness to Innovations
In this paper, a new video super-resolution reconstruction (SRR) method with
improved robustness to outliers is proposed. Although the R-LMS is one of the
SRR algorithms with the best reconstruction quality for its computational cost,
and is naturally robust to registration inaccuracies, its performance is known
to degrade severely in the presence of innovation outliers. By studying the
proximal point cost function representation of the R-LMS iterative equation, a
better understanding of its performance under different situations is attained.
Using statistical properties of typical innovation outliers, a new cost
function is then proposed and two new algorithms are derived, which present
improved robustness to outliers while maintaining computational costs
comparable to that of R-LMS. Monte Carlo simulation results illustrate that the
proposed method outperforms the traditional and regularized versions of LMS,
and is competitive with state-of-the-art SRR methods at a much smaller
computational cost
EDVR: Video Restoration with Enhanced Deformable Convolutional Networks
Video restoration tasks, including super-resolution, deblurring, etc, are
drawing increasing attention in the computer vision community. A challenging
benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark
challenges existing methods from two aspects: (1) how to align multiple frames
given large motions, and (2) how to effectively fuse different frames with
diverse motion and blur. In this work, we propose a novel Video Restoration
framework with Enhanced Deformable networks, termed EDVR, to address these
challenges. First, to handle large motions, we devise a Pyramid, Cascading and
Deformable (PCD) alignment module, in which frame alignment is done at the
feature level using deformable convolutions in a coarse-to-fine manner. Second,
we propose a Temporal and Spatial Attention (TSA) fusion module, in which
attention is applied both temporally and spatially, so as to emphasize
important features for subsequent restoration. Thanks to these modules, our
EDVR wins the champions and outperforms the second place by a large margin in
all four tracks in the NTIRE19 video restoration and enhancement challenges.
EDVR also demonstrates superior performance to state-of-the-art published
methods on video super-resolution and deblurring. The code is available at
https://github.com/xinntao/EDVR.Comment: To appear in CVPR 2019 Workshop. The winners in all four tracks in
the NTIRE 2019 video restoration and enhancement challenges. Project page:
https://xinntao.github.io/projects/EDVR , Code:
https://github.com/xinntao/EDV
MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement
Motion estimation (ME) and motion compensation (MC) have been widely used for
classical video frame interpolation systems over the past decades. Recently, a
number of data-driven frame interpolation methods based on convolutional neural
networks have been proposed. However, existing learning based methods typically
estimate either flow or compensation kernels, thereby limiting performance on
both computational efficiency and interpolation accuracy. In this work, we
propose a motion estimation and compensation driven neural network for video
frame interpolation. A novel adaptive warping layer is developed to integrate
both optical flow and interpolation kernels to synthesize target frame pixels.
This layer is fully differentiable such that both the flow and kernel
estimation networks can be optimized jointly. The proposed model benefits from
the advantages of motion estimation and compensation methods without using
hand-crafted features. Compared to existing methods, our approach is
computationally efficient and able to generate more visually appealing results.
Furthermore, the proposed MEMC-Net can be seamlessly adapted to several video
enhancement tasks, e.g., super-resolution, denoising, and deblocking. Extensive
quantitative and qualitative evaluations demonstrate that the proposed method
performs favorably against the state-of-the-art video frame interpolation and
enhancement algorithms on a wide range of datasets.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine
Intelligenc
FastDVDnet: Towards Real-Time Deep Video Denoising Without Flow Estimation
In this paper, we propose a state-of-the-art video denoising algorithm based
on a convolutional neural network architecture. Until recently, video denoising
with neural networks had been a largely under explored domain, and existing
methods could not compete with the performance of the best patch-based methods.
The approach we introduce in this paper, called FastDVDnet, shows similar or
better performance than other state-of-the-art competitors with significantly
lower computing times. In contrast to other existing neural network denoisers,
our algorithm exhibits several desirable properties such as fast runtimes, and
the ability to handle a wide range of noise levels with a single network model.
The characteristics of its architecture make it possible to avoid using a
costly motion compensation stage while achieving excellent performance. The
combination between its denoising performance and lower computational load
makes this algorithm attractive for practical denoising applications. We
compare our method with different state-of-art algorithms, both visually and
with respect to objective quality metrics.Comment: Code for this algorithm and results can be found in
https://github.com/m-tassano/fastdvdnet. arXiv admin note: text overlap with
arXiv:1906.1189
SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection
Data-driven saliency detection has attracted strong interest as a result of
applying convolutional neural networks to the detection of eye fixations.
Although a number of imagebased salient object and fixation detection models
have been proposed, video fixation detection still requires more exploration.
Different from image analysis, motion and temporal information is a crucial
factor affecting human attention when viewing video sequences. Although
existing models based on local contrast and low-level features have been
extensively researched, they failed to simultaneously consider interframe
motion and temporal information across neighboring video frames, leading to
unsatisfactory performance when handling complex scenes. To this end, we
propose a novel and efficient video eye fixation detection model to improve the
saliency detection performance. By simulating the memory mechanism and visual
attention mechanism of human beings when watching a video, we propose a
step-gained fully convolutional network by combining the memory information on
the time axis with the motion information on the space axis while storing the
saliency information of the current frame. The model is obtained through
hierarchical training, which ensures the accuracy of the detection. Extensive
experiments in comparison with 11 state-of-the-art methods are carried out, and
the results show that our proposed model outperforms all 11 methods across a
number of publicly available datasets
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
On Variational Methods for Motion Compensated Inpainting
We develop in this paper a generic Bayesian framework for the joint
estimation of motion and recovery of missing data in a damaged video sequence.
Using standard maximum a posteriori to variational formulation rationale, we
derive generic minimum energy formulations for the estimation of a
reconstructed sequence as well as motion recovery. We instantiate these energy
formulations and from their Euler-Lagrange Equations, we propose a full
multiresolution algorithms in order to compute good local minimizers for our
energies and discuss their numerical implementations, focusing on the missing
data recovery part, i.e. inpainting. Experimental results for synthetic as well
as real sequences are presented. Image sequences and extra material is
available at http://image.diku.dk/francois/seqinp.php.Comment: DIKU Technical report 2009 with some small correction
Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision
Researchers have developed excellent feed-forward models that learn to map
images to desired outputs, such as to the images' latent factors, or to other
images, using supervised learning. Learning such mappings from unlabelled data,
or improving upon supervised models by exploiting unlabelled data, remains
elusive. We argue that there are two important parts to learning without
annotations: (i) matching the predictions to the input observations, and (ii)
matching the predictions to known priors. We propose Adversarial Inverse
Graphics networks (AIGNs): weakly supervised neural network models that combine
feedback from rendering their predictions, with distribution matching between
their predictions and a collection of ground-truth factors. We apply AIGNs to
3D human pose estimation and 3D structure and egomotion estimation, and
outperform models supervised by only paired annotations. We further apply AIGNs
to facial image transformation using super-resolution and inpainting renderers,
while deliberately adding biases in the ground-truth datasets. Our model
seamlessly incorporates such biases, rendering input faces towards young, old,
feminine, masculine or Tom Cruise-like equivalents (depending on the chosen
bias), or adding lip and nose augmentations while inpainting concealed lips and
noses
EVA: Exploiting Temporal Redundancy in Live Computer Vision
Hardware support for deep convolutional neural networks (CNNs) is critical to
advanced computer vision in mobile and embedded devices. Current designs,
however, accelerate generic CNNs; they do not exploit the unique
characteristics of real-time vision. We propose to use the temporal redundancy
in natural video to avoid unnecessary computation on most frames. A new
algorithm, activation motion compensation, detects changes in the visual input
and incrementally updates a previously-computed output. The technique takes
inspiration from video compression and applies well-known motion estimation
techniques to adapt to visual changes. We use an adaptive key frame rate to
control the trade-off between efficiency and vision quality as the input
changes. We implement the technique in hardware as an extension to existing
state-of-the-art CNN accelerator designs. The new unit reduces the average
energy per frame by 54.2%, 61.7%, and 87.6% for three CNNs with less than 1%
loss in vision accuracy.Comment: Appears in ISCA 201
- …