Video frame interpolation (VFI) aims to synthesize an intermediate frame
between two consecutive frames. State-of-the-art approaches usually adopt a
two-step solution, which includes 1) generating locally-warped pixels by
flow-based motion estimations, 2) blending the warped pixels to form a full
frame through deep neural synthesis networks. However, due to the inconsistent
warping from the two consecutive frames, the warped features for new frames are
usually not aligned, which leads to distorted and blurred frames, especially
when large and complex motions occur. To solve this issue, in this paper we
propose a novel Trajectory-aware Transformer for Video Frame Interpolation
(TTVFI). In particular, we formulate the warped features with inconsistent
motions as query tokens, and formulate relevant regions in a motion trajectory
from two original consecutive frames into keys and values. Self-attention is
learned on relevant tokens along the trajectory to blend the pristine features
into intermediate frames through end-to-end training. Experimental results
demonstrate that our method outperforms other state-of-the-art methods in four
widely-used VFI benchmarks. Both code and pre-trained models will be released
soon