8 research outputs found
Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow
A major challenge for video semantic segmentation is the lack of labeled
data. In most benchmark datasets, only one frame of a video clip is annotated,
which makes most supervised methods fail to utilize information from the rest
of the frames. To exploit the spatio-temporal information in videos, many
previous works use pre-computed optical flows, which encode the temporal
consistency to improve the video segmentation. However, the video segmentation
and optical flow estimation are still considered as two separate tasks. In this
paper, we propose a novel framework for joint video semantic segmentation and
optical flow estimation. Semantic segmentation brings semantic information to
handle occlusion for more robust optical flow estimation, while the
non-occluded optical flow provides accurate pixel-level temporal
correspondences to guarantee the temporal consistency of the segmentation.
Moreover, our framework is able to utilize both labeled and unlabeled frames in
the video through joint training, while no additional calculation is required
in inference. Extensive experiments show that the proposed model makes the
video semantic segmentation and optical flow estimation benefit from each other
and outperforms existing methods under the same settings in both tasks.Comment: Published in AAAI 202