199 research outputs found
Deformable Siamese Attention Networks for Visual Object Tracking
Siamese-based trackers have achieved excellent performance on visual object
tracking. However, the target template is not updated online, and the features
of the target template and search image are computed independently in a Siamese
architecture. In this paper, we propose Deformable Siamese Attention Networks,
referred to as SiamAttn, by introducing a new Siamese attention mechanism that
computes deformable self-attention and cross-attention. The self attention
learns strong context information via spatial attention, and selectively
emphasizes interdependent channel-wise features with channel attention. The
cross-attention is capable of aggregating rich contextual inter-dependencies
between the target template and the search image, providing an implicit manner
to adaptively update the target template. In addition, we design a region
refinement module that computes depth-wise cross correlations between the
attentional features for more accurate tracking. We conduct experiments on six
benchmarks, where our method achieves new state of-the-art results,
outperforming the strong baseline, SiamRPN++ [24], by 0.464->0.537 and
0.415->0.470 EAO on VOT 2016 and 2018. Our code is available at:
https://github.com/msight-tech/research-siamattn.Comment: CVPR 2020, with code available at:
https://github.com/msight-tech/research-siamatt
Learning attentions: residual attentional Siamese Network for high performance online visual tracking
Offline training for object tracking has recently shown
great potentials in balancing tracking accuracy and speed.
However, it is still difficult to adapt an offline trained model
to a target tracked online. This work presents a Residual Attentional
Siamese Network (RASNet) for high performance
object tracking. The RASNet model reformulates the correlation
filter within a Siamese tracking framework, and introduces
different kinds of the attention mechanisms to adapt
the model without updating the model online. In particular,
by exploiting the offline trained general attention, the target
adapted residual attention, and the channel favored feature
attention, the RASNet not only mitigates the over-fitting
problem in deep network training, but also enhances its discriminative
capacity and adaptability due to the separation
of representation learning and discriminator learning. The
proposed deep architecture is trained from end to end and
takes full advantage of the rich spatial temporal information
to achieve robust visual tracking. Experimental results
on two latest benchmarks, OTB-2015 and VOT2017, show
that the RASNet tracker has the state-of-the-art tracking accuracy
while runs at more than 80 frames per second
- …