Video compression has always been a popular research area, where many
traditional and deep video compression methods have been proposed. These
methods typically rely on signal prediction theory to enhance compression
performance by designing high efficient intra and inter prediction strategies
and compressing video frames one by one. In this paper, we propose a novel
model-based video compression (MVC) framework that regards scenes as the
fundamental units for video sequences. Our proposed MVC directly models the
intensity variation of the entire video sequence in one scene, seeking
non-redundant representations instead of reducing redundancy through
spatio-temporal predictions. To achieve this, we employ implicit neural
representation as our basic modeling architecture. To improve the efficiency of
video modeling, we first propose context-related spatial positional embedding
and frequency domain supervision in spatial context enhancement. For temporal
correlation capturing, we design the scene flow constrain mechanism and
temporal contrastive loss. Extensive experimental results demonstrate that our
method achieves up to a 20\% bitrate reduction compared to the latest video
coding standard H.266 and is more efficient in decoding than existing video
coding strategies