As a critical clue of video super-resolution (VSR), inter-frame alignment
significantly impacts overall performance. However, accurate pixel-level
alignment is a challenging task due to the intricate motion interweaving in the
video. In response to this issue, we introduce a novel paradigm for VSR named
Semantic Lens, predicated on semantic priors drawn from degraded videos.
Specifically, video is modeled as instances, events, and scenes via a Semantic
Extractor. Those semantics assist the Pixel Enhancer in understanding the
recovered contents and generating more realistic visual results. The distilled
global semantics embody the scene information of each frame, while the
instance-specific semantics assemble the spatial-temporal contexts related to
each instance. Furthermore, we devise a Semantics-Powered Attention
Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic
knowledge, composed of a Global Perspective Shifter (GPS) and an
Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module
generates pairs of affine transformation parameters for pixel-level feature
modulation conditioned on global semantics. After that, the ISEE module
harnesses the attention mechanism to align the adjacent frames in the
instance-centric semantic space. In addition, we incorporate a simple yet
effective pre-alignment module to alleviate the difficulty of model training.
Extensive experiments demonstrate the superiority of our model over existing
state-of-the-art VSR methods.Comment: Accepted to AAAI 202