519 research outputs found
H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions
Capitalizing on the rapid development of neural networks, recent video frame
interpolation (VFI) methods have achieved notable improvements. However, they
still fall short for real-world videos containing large motions. Complex
deformation and/or occlusion caused by large motions make it an extremely
difficult problem in video frame interpolation. In this paper, we propose a
simple yet effective solution, H-VFI, to deal with large motions in video frame
interpolation. H-VFI contributes a hierarchical video interpolation transformer
(HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple
scales. The learnt deformable kernel is then utilized in convolving the input
frames for predicting the interpolated frame. Starting from the smallest scale,
H-VFI updates the deformable kernel by a residual in succession based on former
predicted kernels, intermediate interpolated results and hierarchical features
from transformer. Bias and masks to refine the final outputs are then predicted
by a transformer block based on interpolated results. The advantage of such a
progressive approximation is that the large motion frame interpolation problem
can be decomposed into several relatively simpler sub-tasks, which enables a
very accurate prediction in the final results. Another noteworthy contribution
of our paper consists of a large-scale high-quality dataset, YouTube200K, which
contains videos depicting a great variety of scenarios captured at high
resolution and high frame rate. Extensive experiments on multiple frame
interpolation benchmarks validate that H-VFI outperforms existing
state-of-the-art methods especially for videos with large motions
Efficient Convolution and Transformer-Based Network for Video Frame Interpolation
Video frame interpolation is an increasingly important research task with
several key industrial applications in the video coding, broadcast and
production sectors. Recently, transformers have been introduced to the field
resulting in substantial performance gains. However, this comes at a cost of
greatly increased memory usage, training and inference time. In this paper, a
novel method integrating a transformer encoder and convolutional features is
proposed. This network reduces the memory burden by close to 50% and runs up to
four times faster during inference time compared to existing transformer-based
interpolation methods. A dual-encoder architecture is introduced which combines
the strength of convolutions in modelling local correlations with those of the
transformer for long-range dependencies. Quantitative evaluations are conducted
on various benchmarks with complex motion to showcase the robustness of the
proposed method, achieving competitive performance compared to state-of-the-art
interpolation networks.Comment: Paper accepted in IEEE ICIP 2023: International Conference on Image
Processing 202
JNMR: Joint Non-linear Motion Regression for Video Frame Interpolation
Video frame interpolation (VFI) aims to generate predictive frames by warping
learnable motions from the bidirectional historical references. Most existing
works utilize spatio-temporal semantic information extractor to realize motion
estimation and interpolation modeling. However, they insufficiently consider
the real mechanistic rationality of generated middle motions. In this paper, we
reformulate VFI as a Joint Non-linear Motion Regression (JNMR) strategy to
model the complicated motions of inter-frame. Specifically, the motion
trajectory between the target frame and the multiple reference frames is
regressed by a temporal concatenation of multi-stage quadratic models. ConvLSTM
is adopted to construct this joint distribution of complete motions in temporal
dimension. Moreover, the feature learning network is designed to optimize for
the joint regression modeling. A coarse-to-fine synthesis enhancement module is
also conducted to learn visual dynamics at different resolutions through
repetitive regression and interpolation. Experimental results on VFI show that
the effectiveness and significant improvement of joint motion regression
compared with the state-of-the-art methods. The code is available at
https://github.com/ruhig6/JNMR.Comment: Accepted by IEEE Transactions on Image Processing (TIP
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation
Video frame interpolation (VFI) aims to synthesize an intermediate frame
between two consecutive frames. State-of-the-art approaches usually adopt a
two-step solution, which includes 1) generating locally-warped pixels by
flow-based motion estimations, 2) blending the warped pixels to form a full
frame through deep neural synthesis networks. However, due to the inconsistent
warping from the two consecutive frames, the warped features for new frames are
usually not aligned, which leads to distorted and blurred frames, especially
when large and complex motions occur. To solve this issue, in this paper we
propose a novel Trajectory-aware Transformer for Video Frame Interpolation
(TTVFI). In particular, we formulate the warped features with inconsistent
motions as query tokens, and formulate relevant regions in a motion trajectory
from two original consecutive frames into keys and values. Self-attention is
learned on relevant tokens along the trajectory to blend the pristine features
into intermediate frames through end-to-end training. Experimental results
demonstrate that our method outperforms other state-of-the-art methods in four
widely-used VFI benchmarks. Both code and pre-trained models will be released
soon
Flow Guidance Deformable Compensation Network for Video Frame Interpolation
Motion-based video frame interpolation (VFI) methods have made remarkable
progress with the development of deep convolutional networks over the past
years. While their performance is often jeopardized by the inaccuracy of flow
map estimation, especially in the case of large motion and occlusion. In this
paper, we propose a flow guidance deformable compensation network (FGDCN) to
overcome the drawbacks of existing motion-based methods. FGDCN decomposes the
frame sampling process into two steps: a flow step and a deformation step.
Specifically, the flow step utilizes a coarse-to-fine flow estimation network
to directly estimate the intermediate flows and synthesizes an anchor frame
simultaneously. To ensure the accuracy of the estimated flow, a distillation
loss and a task-oriented loss are jointly employed in this step. Under the
guidance of the flow priors learned in step one, the deformation step designs a
pyramid deformable compensation network to compensate for the missing details
of the flow step. In addition, a pyramid loss is proposed to supervise the
model in both the image and frequency domain. Experimental results show that
the proposed algorithm achieves excellent performance on various datasets with
fewer parameters
- …