Video instance segmentation, also known as multi-object tracking and
segmentation, is an emerging computer vision research area introduced in 2019,
aiming at detecting, segmenting, and tracking instances in videos
simultaneously. By tackling the video instance segmentation tasks through
effective analysis and utilization of visual information in videos, a range of
computer vision-enabled applications (e.g., human action recognition, medical
image processing, autonomous vehicle navigation, surveillance, etc) can be
implemented. As deep-learning techniques take a dominant role in various
computer vision areas, a plethora of deep-learning-based video instance
segmentation schemes have been proposed. This survey offers a multifaceted view
of deep-learning schemes for video instance segmentation, covering various
architectural paradigms, along with comparisons of functional performance,
model complexity, and computational overheads. In addition to the common
architectural designs, auxiliary techniques for improving the performance of
deep-learning models for video instance segmentation are compiled and
discussed. Finally, we discuss a range of major challenges and directions for
further investigations to help advance this promising research field