1 research outputs found
Towards Accurate Pixel-wise Object Tracking by Attention Retrieval
The encoding of the target in object tracking moves from the coarse
bounding-box to fine-grained segmentation map recently. Revisiting de facto
real-time approaches that are capable of predicting mask during tracking, we
observed that they usually fork a light branch from the backbone network for
segmentation. Although efficient, directly fusing backbone features without
considering the negative influence of background clutter tends to introduce
false-negative predictions, lagging the segmentation accuracy. To mitigate this
problem, we propose an attention retrieval network (ARN) to perform soft
spatial constraints on backbone features. We first build a look-up-table (LUT)
with the ground-truth mask in the starting frame, and then retrieves the LUT to
obtain an attention map for spatial constraints. Moreover, we introduce a
multi-resolution multi-stage segmentation network (MMS) to further weaken the
influence of background clutter by reusing the predicted mask to filter
backbone features. Our approach set a new state-of-the-art on recent pixel-wise
object tracking benchmark VOT2020 while running at 40 fps. Notably, the
proposed model surpasses SiamMask by 11.7/4.2/5.5 points on VOT2020, DAVIS2016,
and DAVIS2017, respectively. We will release our code at
https://github.com/researchmm/TracKit.Comment: Some technical errors. We would provide new versions late