Most existing transformer based video instance segmentation methods extract
per frame features independently, hence it is challenging to solve the
appearance deformation problem. In this paper, we observe the temporal
information is important as well and we propose TAFormer to aggregate
spatio-temporal features both in transformer encoder and decoder. Specifically,
in transformer encoder, we propose a novel spatio-temporal joint multi-scale
deformable attention module which dynamically integrates the spatial and
temporal information to obtain enriched spatio-temporal features. In
transformer decoder, we introduce a temporal self-attention module to enhance
the frame level box queries with the temporal relation. Moreover, TAFormer
adopts an instance level contrastive loss to increase the discriminability of
instance query embeddings. Therefore the tracking error caused by visually
similar instances can be decreased. Experimental results show that TAFormer
effectively leverages the spatial and temporal information to obtain
context-aware feature representation and outperforms state-of-the-art methods