1 research outputs found
Mask Propagation for Efficient Video Semantic Segmentation
Video Semantic Segmentation (VSS) involves assigning a semantic label to each
pixel in a video sequence. Prior work in this field has demonstrated promising
results by extending image semantic segmentation models to exploit temporal
relationships across video frames; however, these approaches often incur
significant computational costs. In this paper, we propose an efficient mask
propagation framework for VSS, called MPVSS. Our approach first employs a
strong query-based image segmentor on sparse key frames to generate accurate
binary masks and class predictions. We then design a flow estimation module
utilizing the learned queries to generate a set of segment-aware flow maps,
each associated with a mask prediction from the key frame. Finally, the
mask-flow pairs are warped to serve as the mask predictions for the non-key
frames. By reusing predictions from key frames, we circumvent the need to
process a large volume of video frames individually with resource-intensive
segmentors, alleviating temporal redundancy and significantly reducing
computational costs. Extensive experiments on VSPW and Cityscapes demonstrate
that our mask propagation framework achieves SOTA accuracy and efficiency
trade-offs. For instance, our best model with Swin-L backbone outperforms the
SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW
dataset. Moreover, our framework reduces up to 4x FLOPs compared to the
per-frame Mask2Former baseline with only up to 2% mIoU degradation on the
Cityscapes validation set. Code is available at
https://github.com/ziplab/MPVSS.Comment: NeurIPS 202