8,032 research outputs found
Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
We propose a novel feed-forward network for video inpainting. We use a set of
sampled video frames as the reference to take visible contents to fill the hole
of a target frame. Our video inpainting network consists of two stages. The
first stage is an alignment module that uses computed homographies between the
reference frames and the target frame. The visible patches are then aggregated
based on the frame similarity to fill in the target holes roughly. The second
stage is a non-local attention module that matches the generated patches with
known reference patches (in space and time) to refine the previous global
alignment stage. Both stages consist of large spatial-temporal window size for
the reference and thus enable modeling long-range correlations between distant
information and the hole regions. Therefore, even challenging scenes with large
or slowly moving holes can be handled, which have been hardly modeled by
existing flow-based approach. Our network is also designed with a recurrent
propagation stream to encourage temporal consistency in video results.
Experiments on video object removal demonstrate that our method inpaints the
holes with globally and locally coherent contents
Deep multi-frame face super-resolution
Face verification and recognition problems have seen rapid progress in recent
years, however recognition from small size images remains a challenging task
that is inherently intertwined with the task of face super-resolution. Tackling
this problem using multiple frames is an attractive idea, yet requires solving
the alignment problem that is also challenging for low-resolution faces. Here
we present a holistic system for multi-frame recognition, alignment, and
superresolution of faces. Our neural network architecture restores the central
frame of each input sequence additionally taking into account a number of
adjacent frames and making use of sub-pixel movements. We present our results
using the popular dataset for video face recognition (YouTube Faces). We show a
notable improvement of identification score compared to several baselines
including the one based on single-image super-resolution
Perceptual Video Super Resolution with Enhanced Temporal Consistency
With the advent of perceptual loss functions, new possibilities in
super-resolution have emerged, and we currently have models that successfully
generate near-photorealistic high-resolution images from their low-resolution
observations. Up to now, however, such approaches have been exclusively limited
to single image super-resolution. The application of perceptual loss functions
on video processing still entails several challenges, mostly related to the
lack of temporal consistency of the generated images, i.e., flickering
artifacts. In this work, we present a novel adversarial recurrent network for
video upscaling that is able to produce realistic textures in a temporally
consistent way. The proposed architecture naturally leverages information from
previous frames due to its recurrent architecture, i.e. the input to the
generator is composed of the low-resolution image and, additionally, the warped
output of the network at the previous step. Together with a video
discriminator, we also propose additional loss functions to further reinforce
temporal consistency in the generated sequences. The experimental validation of
our algorithm shows the effectiveness of our approach which obtains images with
high perceptual quality and improved temporal consistency.Comment: Major revision and improvement of the manuscript: New network
architecture, new loss function and extended experiment
Content-Preserving Image Stitching with Regular Boundary Constraints
This paper proposes an approach to content-preserving stitching of images
with regular boundary constraints, which aims to stitch multiple images to
generate a panoramic image with regular boundary. Existing methods treat image
stitching and rectangling as two separate steps, which may result in suboptimal
results as the stitching process is not aware of the further warping needs for
rectangling. We address these limitations by formulating image stitching with
regular boundaries in a unified optimization. Starting from the initial
stitching results produced by traditional warping-based optimization, we obtain
the irregular boundary from the warped meshes by polygon Boolean operations
which robustly handle arbitrary mesh compositions, and by analyzing the
irregular boundary construct a piecewise rectangular boundary. Based on this,
we further incorporate straight line preserving and regular boundary
constraints into the image stitching framework, and conduct iterative
optimization to obtain an optimal piecewise rectangular boundary, thus can make
the panoramic boundary as close as possible to a rectangle, while reducing
unwanted distortions. We further extend our method to panoramic videos and
selfie photography, by integrating the temporal coherence and portrait
preservation into the optimization. Experiments show that our method
efficiently produces visually pleasing panoramas with regular boundaries and
unnoticeable distortions.Comment: 12 figures, 13 page
Image Resizing by Reconstruction from Deep Features
Traditional image resizing methods usually work in pixel space and use
various saliency measures. The challenge is to adjust the image shape while
trying to preserve important content. In this paper we perform image resizing
in feature space where the deep layers of a neural network contain rich
important semantic information. We directly adjust the image feature maps,
extracted from a pre-trained classification network, and reconstruct the
resized image using a neural-network based optimization. This novel approach
leverages the hierarchical encoding of the network, and in particular, the
high-level discriminative power of its deeper layers, that recognizes semantic
objects and regions and allows maintaining their aspect ratio. Our use of
reconstruction from deep features diminishes the artifacts introduced by
image-space resizing operators. We evaluate our method on benchmarks, compare
to alternative approaches, and demonstrate its strength on challenging images.Comment: 13 pages, 21 figure
Robust Registration of Gaussian Mixtures for Colour Transfer
We present a flexible approach to colour transfer inspired by techniques
recently proposed for shape registration. Colour distributions of the palette
and target images are modelled with Gaussian Mixture Models (GMMs) that are
robustly registered to infer a non linear parametric transfer function. We show
experimentally that our approach compares well to current techniques both
quantitatively and qualitatively. Moreover, our technique is computationally
the fastest and can take efficient advantage of parallel processing
architectures for recolouring images and videos. Our transfer function is
parametric and hence can be stored in memory for later usage and also combined
with other computed transfer functions to create interesting visual effects.
Overall this paper provides a fast user friendly approach to recolouring of
image and video materials
A Novel Semantics and Feature Preserving Perspective for Content Aware Image Retargeting
There is an increasing requirement for efficient image retargeting techniques
to adapt the content to various forms of digital media. With rapid growth of
mobile communications and dynamic web page layouts, one often needs to resize
the media content to adapt to the desired display sizes. For various layouts of
web pages and typically small sizes of handheld portable devices, the
importance in the original image content gets obfuscated after resizing it with
the approach of uniform scaling. Thus, there occurs a need for resizing the
images in a content aware manner which can automatically discard irrelevant
information from the image and present the salient features with more
magnitude. There have been proposed some image retargeting techniques keeping
in mind the content awareness of the input image. However, these techniques
fail to prove globally effective for various kinds of images and desired sizes.
The major problem is the inefficiency of these algorithms to process these
images with minimal visual distortion while also retaining the meaning conveyed
from the image. In this dissertation, we present a novel perspective for
content aware image retargeting, which is well implementable in real time. We
introduce a novel method of analysing semantic information within the input
image while also maintaining the important and visually significant features.
We present the various nuances of our algorithm mathematically and logically,
and show that the results prove better than the state-of-the-art techniques.Comment: 74 Pages, 46 Figures, Masters Thesi
Towards Fine-grained Human Pose Transfer with Detail Replenishing Network
Human pose transfer (HPT) is an emerging research topic with huge potential
in fashion design, media production, online advertising and virtual reality.
For these applications, the visual realism of fine-grained appearance details
is crucial for production quality and user engagement. However, existing HPT
methods often suffer from three fundamental issues: detail deficiency, content
ambiguity and style inconsistency, which severely degrade the visual quality
and realism of generated images. Aiming towards real-world applications, we
develop a more challenging yet practical HPT setting, termed as Fine-grained
Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail
replenishment. Concretely, we analyze the potential design flaws of existing
methods via an illustrative example, and establish the core FHPT methodology by
combing the idea of content synthesis and feature transfer together in a
mutually-guided fashion. Thereafter, we substantiate the proposed methodology
with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine
model training scheme. Moreover, we build up a complete suite of fine-grained
evaluation protocols to address the challenges of FHPT in a comprehensive
manner, including semantic analysis, structural detection and perceptual
quality assessment. Extensive experiments on the DeepFashion benchmark dataset
have verified the power of proposed benchmark against start-of-the-art works,
with 12\%-14\% gain on top-10 retrieval recall, 5\% higher joint localization
accuracy, and near 40\% gain on face identity preservation. Moreover, the
evaluation results offer further insights to the subject matter, which could
inspire many promising future works along this direction.Comment: IEEE TIP submissio
Dynamic Temporal Alignment of Speech to Lips
Many speech segments in movies are re-recorded in a studio during
postproduction, to compensate for poor sound quality as recorded on location.
Manual alignment of the newly-recorded speech with the original lip movements
is a tedious task. We present an audio-to-video alignment method for automating
speech to lips alignment, stretching and compressing the audio signal to match
the lip movements. This alignment is based on deep audio-visual features,
mapping the lips video and the speech signal to a shared representation. Using
this shared representation we compute the lip-sync error between every short
speech period and every video frame, followed by the determination of the
optimal corresponding frame for each short sound period over the entire video
clip. We demonstrate successful alignment both quantitatively, using a human
perception-inspired metric, as well as qualitatively. The strongest advantage
of our audio-to-video approach is in cases where the original voice in unclear,
and where a constant shift of the sound can not give a perfect alignment. In
these cases state-of-the-art methods will fail
Elastic Functional Coding of Riemannian Trajectories
Visual observations of dynamic phenomena, such as human actions, are often
represented as sequences of smoothly-varying features . In cases where the
feature spaces can be structured as Riemannian manifolds, the corresponding
representations become trajectories on manifolds. Analysis of these
trajectories is challenging due to non-linearity of underlying spaces and
high-dimensionality of trajectories. In vision problems, given the nature of
physical systems involved, these phenomena are better characterized on a
low-dimensional manifold compared to the space of Riemannian trajectories. For
instance, if one does not impose physical constraints of the human body, in
data involving human action analysis, the resulting representation space will
have highly redundant features. Learning an effective, low-dimensional
embedding for action representations will have a huge impact in the areas of
search and retrieval, visualization, learning, and recognition. The difficulty
lies in inherent non-linearity of the domain and temporal variability of
actions that can distort any traditional metric between trajectories. To
overcome these issues, we use the framework based on transported square-root
velocity fields (TSRVF); this framework has several desirable properties,
including a rate-invariant metric and vector space representations. We propose
to learn an embedding such that each action trajectory is mapped to a single
point in a low-dimensional Euclidean space, and the trajectories that differ
only in temporal rates map to the same point. We utilize the TSRVF
representation, and accompanying statistical summaries of Riemannian
trajectories, to extend existing coding methods such as PCA, KSVD and Label
Consistent KSVD to Riemannian trajectories or more generally to Riemannian
functions.Comment: Under major revision at IEEE T-PAMI, 201
- …