We address the problem of efficiently compressing video for conferencing-type
applications. We build on recent approaches based on image animation, which can
achieve good reconstruction quality at very low bitrate by representing face
motions with a compact set of sparse keypoints. However, these methods encode
video in a frame-by-frame fashion, i.e. each frame is reconstructed from a
reference frame, which limits the reconstruction quality when the bandwidth is
larger. Instead, we propose a predictive coding scheme which uses image
animation as a predictor, and codes the residual with respect to the actual
target frame. The residuals can be in turn coded in a predictive manner, thus
removing efficiently temporal dependencies. Our experiments indicate a
significant bitrate gain, in excess of 70% compared to the HEVC video standard
and over 30% compared to VVC, on a datasetof talking-head videosComment: Accepted paper: ICIP 202