30,770 research outputs found
Learning attentions: residual attentional Siamese Network for high performance online visual tracking
Offline training for object tracking has recently shown
great potentials in balancing tracking accuracy and speed.
However, it is still difficult to adapt an offline trained model
to a target tracked online. This work presents a Residual Attentional
Siamese Network (RASNet) for high performance
object tracking. The RASNet model reformulates the correlation
filter within a Siamese tracking framework, and introduces
different kinds of the attention mechanisms to adapt
the model without updating the model online. In particular,
by exploiting the offline trained general attention, the target
adapted residual attention, and the channel favored feature
attention, the RASNet not only mitigates the over-fitting
problem in deep network training, but also enhances its discriminative
capacity and adaptability due to the separation
of representation learning and discriminator learning. The
proposed deep architecture is trained from end to end and
takes full advantage of the rich spatial temporal information
to achieve robust visual tracking. Experimental results
on two latest benchmarks, OTB-2015 and VOT2017, show
that the RASNet tracker has the state-of-the-art tracking accuracy
while runs at more than 80 frames per second
Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism
In this paper, we propose a CNN-based framework for online MOT. This
framework utilizes the merits of single object trackers in adapting appearance
models and searching for target in the next frame. Simply applying single
object tracker for MOT will encounter the problem in computational efficiency
and drifted results caused by occlusion. Our framework achieves computational
efficiency by sharing features and using ROI-Pooling to obtain individual
features for each target. Some online learned target-specific CNN layers are
used for adapting the appearance model for each target. In the framework, we
introduce spatial-temporal attention mechanism (STAM) to handle the drift
caused by occlusion and interaction among targets. The visibility map of the
target is learned and used for inferring the spatial attention map. The spatial
attention map is then applied to weight the features. Besides, the occlusion
status can be estimated from the visibility map, which controls the online
updating process via weighted loss on training samples with different occlusion
statuses in different frames. It can be considered as temporal attention
mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on
challenging MOT15 and MOT16 benchmark dataset respectively.Comment: Accepted at International Conference on Computer Vision (ICCV) 201
- …