1,197 research outputs found
End-to-end Flow Correlation Tracking with Spatial-temporal Attention
Discriminative correlation filters (DCF) with deep convolutional features
have achieved favorable performance in recent tracking benchmarks. However,
most of existing DCF trackers only consider appearance features of current
frame, and hardly benefit from motion and inter-frame information. The lack of
temporal information degrades the tracking performance during challenges such
as partial occlusion and deformation. In this work, we focus on making use of
the rich flow information in consecutive frames to improve the feature
representation and the tracking accuracy. Firstly, individual components,
including optical flow estimation, feature extraction, aggregation and
correlation filter tracking are formulated as special layers in network. To the
best of our knowledge, this is the first work to jointly train flow and
tracking task in a deep learning framework. Then the historical feature maps at
predefined intervals are warped and aggregated with current ones by the guiding
of flow. For adaptive aggregation, we propose a novel spatial-temporal
attention mechanism. Extensive experiments are performed on four challenging
tracking datasets: OTB2013, OTB2015, VOT2015 and VOT2016, and the proposed
method achieves superior results on these benchmarks.Comment: Accepted in CVPR 201
Deformable Siamese Attention Networks for Visual Object Tracking
Siamese-based trackers have achieved excellent performance on visual object
tracking. However, the target template is not updated online, and the features
of the target template and search image are computed independently in a Siamese
architecture. In this paper, we propose Deformable Siamese Attention Networks,
referred to as SiamAttn, by introducing a new Siamese attention mechanism that
computes deformable self-attention and cross-attention. The self attention
learns strong context information via spatial attention, and selectively
emphasizes interdependent channel-wise features with channel attention. The
cross-attention is capable of aggregating rich contextual inter-dependencies
between the target template and the search image, providing an implicit manner
to adaptively update the target template. In addition, we design a region
refinement module that computes depth-wise cross correlations between the
attentional features for more accurate tracking. We conduct experiments on six
benchmarks, where our method achieves new state of-the-art results,
outperforming the strong baseline, SiamRPN++ [24], by 0.464->0.537 and
0.415->0.470 EAO on VOT 2016 and 2018. Our code is available at:
https://github.com/msight-tech/research-siamattn.Comment: CVPR 2020, with code available at:
https://github.com/msight-tech/research-siamatt
- …