3 research outputs found
Learning Cascaded Siamese Networks for High Performance Visual Tracking
Visual tracking is one of the most challenging computer vision problems. In
order to achieve high performance visual tracking in various negative
scenarios, a novel cascaded Siamese network is proposed and developed based on
two different deep learning networks: a matching subnetwork and a
classification subnetwork. The matching subnetwork is a fully convolutional
Siamese network. According to the similarity score between the exemplar image
and the candidate image, it aims to search possible object positions and crop
scaled candidate patches. The classification subnetwork is designed to further
evaluate the cropped candidate patches and determine the optimal tracking
results based on the classification score. The matching subnetwork is trained
offline and fixed online, while the classification subnetwork performs
stochastic gradient descent online to learn more target-specific information.
To improve the tracking performance further, an effective classification
subnetwork update method based on both similarity and classification scores is
utilized for updating the classification subnetwork. Extensive experimental
results demonstrate that our proposed approach achieves state-of-the-art
performance in recent benchmarks.Comment: Accepted for IEEE 26th International Conference on Image Processing
(ICIP 2019
Siamese Attentional Keypoint Network for High Performance Visual Tracking
In this paper, we investigate the impacts of three main aspects of visual
tracking, i.e., the backbone network, the attentional mechanism, and the
detection component, and propose a Siamese Attentional Keypoint Network, dubbed
SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese
lightweight hourglass network is specially designed for visual tracking. It
takes advantage of the benefits of the repeated bottom-up and top-down
inference to capture more global and local contextual information at multiple
scales. Secondly, a novel cross-attentional module is utilized to leverage both
channel-wise and spatial intermediate attentional information, which can
enhance both discriminative and localization capabilities of feature maps.
Thirdly, a keypoints detection approach is invented to trace any target object
by detecting the top-left corner point, the centroid point, and the
bottom-right corner point of its bounding box. Therefore, our SATIN tracker not
only has a strong capability to learn more effective object representations,
but also is computational and memory storage efficiency, either during the
training or testing stages. To the best of our knowledge, we are the first to
propose this approach. Without bells and whistles, experimental results
demonstrate that our approach achieves state-of-the-art performance on several
recent benchmark datasets, at a speed far exceeding 27 frames per second.Comment: Accepted by Knowledge-Based SYSTEM
Learning Reinforced Attentional Representation for End-to-End Visual Tracking
Although numerous recent tracking approaches have made tremendous advances in
the last decade, achieving high-performance visual tracking remains a
challenge. In this paper, we propose an end-to-end network model to learn
reinforced attentional representation for accurate target object discrimination
and localization. We utilize a novel hierarchical attentional module with long
short-term memory and multi-layer perceptrons to leverage both inter- and
intra-frame attention to effectively facilitate visual pattern emphasis.
Moreover, we incorporate a contextual attentional correlation filter into the
backbone network to make our model trainable in an end-to-end fashion. Our
proposed approach not only takes full advantage of informative geometries and
semantics but also updates correlation filters online without fine-tuning the
backbone network to enable the adaptation of variations in the target object's
appearance. Extensive experiments conducted on several popular benchmark
datasets demonstrate that our proposed approach is effective and
computationally efficient.Comment: Accepted by Information Science