85 research outputs found
Global Motion Estimation and Its Applications
In this chapter, global motion estimation and its applications are given. Firstly we give the definitions of global motion and global motion estimation. Secondly, the parametric representations of global motion models are provided. Thirdly, global estimation approaches including pixel domain based global motion estimation, hierarchical globa
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation
Video frame interpolation (VFI) aims to synthesize an intermediate frame
between two consecutive frames. State-of-the-art approaches usually adopt a
two-step solution, which includes 1) generating locally-warped pixels by
flow-based motion estimations, 2) blending the warped pixels to form a full
frame through deep neural synthesis networks. However, due to the inconsistent
warping from the two consecutive frames, the warped features for new frames are
usually not aligned, which leads to distorted and blurred frames, especially
when large and complex motions occur. To solve this issue, in this paper we
propose a novel Trajectory-aware Transformer for Video Frame Interpolation
(TTVFI). In particular, we formulate the warped features with inconsistent
motions as query tokens, and formulate relevant regions in a motion trajectory
from two original consecutive frames into keys and values. Self-attention is
learned on relevant tokens along the trajectory to blend the pristine features
into intermediate frames through end-to-end training. Experimental results
demonstrate that our method outperforms other state-of-the-art methods in four
widely-used VFI benchmarks. Both code and pre-trained models will be released
soon
Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution
Existing real-world video super-resolution (VSR) methods focus on designing a
general degradation pipeline for open-domain videos while ignoring data
intrinsic characteristics which strongly limit their performance when applying
to some specific domains (eg., animation videos). In this paper, we thoroughly
explore the characteristics of animation videos and leverage the rich priors in
real-world animation data for a more practical animation VSR model. In
particular, we propose a multi-scale Vector-Quantized Degradation model for
animation video Super-Resolution (VQD-SR) to decompose the local details from
global structures and transfer the degradation priors in real-world animation
videos to a learned vector-quantized codebook for degradation modeling. A
rich-content Real Animation Low-quality (RAL) video dataset is collected for
extracting the priors. We further propose a data enhancement strategy for
high-resolution (HR) training videos based on our observation that existing HR
videos are mostly collected from the Web which contains conspicuous compression
artifacts. The proposed strategy is valid to lift the upper bound of animation
VSR performance, regardless of the specific VSR model. Experimental results
demonstrate the superiority of the proposed VQD-SR over state-of-the-art
methods, through extensive quantitative and qualitative evaluations of the
latest animation video super-resolution benchmark. The code and pre-trained
models can be downloaded at https://github.com/researchmm/VQD-SR
Dual Relation Alignment for Composed Image Retrieval
Composed image retrieval, a task involving the search for a target image
using a reference image and a complementary text as the query, has witnessed
significant advancements owing to the progress made in cross-modal modeling.
Unlike the general image-text retrieval problem with only one alignment
relation, i.e., image-text, we argue for the existence of two types of
relations in composed image retrieval. The explicit relation pertains to the
reference image & complementary text-target image, which is commonly exploited
by existing methods. Besides this intuitive relation, the observations during
our practice have uncovered another implicit yet crucial relation, i.e.,
reference image & target image-complementary text, since we found that the
complementary text can be inferred by studying the relation between the target
image and the reference image. Regrettably, existing methods largely focus on
leveraging the explicit relation to learn their networks, while overlooking the
implicit relation. In response to this weakness, We propose a new framework for
composed image retrieval, termed dual relation alignment, which integrates both
explicit and implicit relations to fully exploit the correlations among the
triplets. Specifically, we design a vision compositor to fuse reference image
and target image at first, then the resulted representation will serve two
roles: (1) counterpart for semantic alignment with the complementary text and
(2) compensation for the complementary text to boost the explicit relation
modeling, thereby implant the implicit relation into the alignment learning.
Our method is evaluated on two popular datasets, CIRR and FashionIQ, through
extensive experiments. The results confirm the effectiveness of our
dual-relation learning in substantially enhancing composed image retrieval
performance
Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera
Under-display camera (UDC) systems are the foundation of full-screen display
devices in which the lens mounts under the display. The pixel array of
light-emitting diodes used for display diffracts and attenuates incident light,
causing various degradations as the light intensity changes. Unlike general
video restoration which recovers video by treating different degradation
factors equally, video restoration for UDC systems is more challenging that
concerns removing diverse degradation over time while preserving temporal
consistency. In this paper, we introduce a novel video restoration network,
called DRNet, specifically designed for UDC systems. It employs a set of
Decoupling Attention Modules (DAM) that effectively separate the various video
degradation factors. More specifically, a soft mask generation function is
proposed to formulate each frame into flare and haze based on the diffraction
arising from incident light of different intensities, followed by the proposed
flare and haze removal components that leverage long- and short-term feature
learning to handle the respective degradations. Such a design offers an
targeted and effective solution to eliminating various types of degradation in
UDC systems. We further extend our design into multi-scale to overcome the
scale-changing of degradation that often occur in long-range videos. To
demonstrate the superiority of DRNet, we propose a large-scale UDC video
benchmark by gathering HDR videos and generating realistically degraded videos
using the point spread function measured by a commercial UDC system. Extensive
quantitative and qualitative evaluations demonstrate the superiority of
DRNet compared to other state-of-the-art video restoration and UDC image
restoration methods. Code is available at
https://github.com/ChengxuLiu/DDRNet.gitComment: AAAI 202
- …