For any video codecs, the coding efficiency highly relies on whether the
current signal to be encoded can find the relevant contexts from the previous
reconstructed signals. Traditional codec has verified more contexts bring
substantial coding gain, but in a time-consuming manner. However, for the
emerging neural video codec (NVC), its contexts are still limited, leading to
low compression ratio. To boost NVC, this paper proposes increasing the context
diversity in both temporal and spatial dimensions. First, we guide the model to
learn hierarchical quality patterns across frames, which enriches long-term and
yet high-quality temporal contexts. Furthermore, to tap the potential of
optical flow-based coding framework, we introduce a group-based offset
diversity where the cross-group interaction is proposed for better context
mining. In addition, this paper also adopts a quadtree-based partition to
increase spatial context diversity when encoding the latent representation in
parallel. Experiments show that our codec obtains 23.5% bitrate saving over
previous SOTA NVC. Better yet, our codec has surpassed the under-developing
next generation traditional codec/ECM in both RGB and YUV420 colorspaces, in
terms of PSNR. The codes are at https://github.com/microsoft/DCVC.Comment: Accepted by CVPR 2023. Codes are at https://github.com/microsoft/DCV