Search CORE

3 research outputs found

Learning Cascaded Siamese Networks for High Performance Visual Tracking

Author: Gao Peng
Ma Yipeng
Wang Fei
Xiao Liyi
Yuan Ruyue
Publication venue
Publication date: 07/05/2019
Field of study

Visual tracking is one of the most challenging computer vision problems. In order to achieve high performance visual tracking in various negative scenarios, a novel cascaded Siamese network is proposed and developed based on two different deep learning networks: a matching subnetwork and a classification subnetwork. The matching subnetwork is a fully convolutional Siamese network. According to the similarity score between the exemplar image and the candidate image, it aims to search possible object positions and crop scaled candidate patches. The classification subnetwork is designed to further evaluate the cropped candidate patches and determine the optimal tracking results based on the classification score. The matching subnetwork is trained offline and fixed online, while the classification subnetwork performs stochastic gradient descent online to learn more target-specific information. To improve the tracking performance further, an effective classification subnetwork update method based on both similarity and classification scores is utilized for updating the classification subnetwork. Extensive experimental results demonstrate that our proposed approach achieves state-of-the-art performance in recent benchmarks.Comment: Accepted for IEEE 26th International Conference on Image Processing (ICIP 2019

arXiv.org e-Print Archive

Siamese Attentional Keypoint Network for High Performance Visual Tracking

Author: Fujita Hamido
Gao Peng
Wang Fei
Xiao Liyi
Yuan Ruyue
Zhang Yan
Publication venue: 'Elsevier BV'
Publication date: 28/12/2019
Field of study

In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second.Comment: Accepted by Knowledge-Based SYSTEM

arXiv.org e-Print Archive

Learning Reinforced Attentional Representation for End-to-End Visual Tracking

Author: Fujita Hamido
Gao Peng
Wang Fei
Xiao Liyi
Zhang Qiquan
Zhang Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Although numerous recent tracking approaches have made tremendous advances in the last decade, achieving high-performance visual tracking remains a challenge. In this paper, we propose an end-to-end network model to learn reinforced attentional representation for accurate target object discrimination and localization. We utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to leverage both inter- and intra-frame attention to effectively facilitate visual pattern emphasis. Moreover, we incorporate a contextual attentional correlation filter into the backbone network to make our model trainable in an end-to-end fashion. Our proposed approach not only takes full advantage of informative geometries and semantics but also updates correlation filters online without fine-tuning the backbone network to enable the adaptation of variations in the target object's appearance. Extensive experiments conducted on several popular benchmark datasets demonstrate that our proposed approach is effective and computationally efficient.Comment: Accepted by Information Science

arXiv.org e-Print Archive