10 research outputs found

    RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

    Full text link
    Existing Transformer-based RGBT tracking methods either use cross-attention to fuse the two modalities, or use self-attention and cross-attention to model both modality-specific and modality-sharing information. However, the significant appearance gap between modalities limits the feature representation ability of certain modalities during the fusion process. To address this problem, we propose a novel Progressive Fusion Transformer called ProFormer, which progressively integrates single-modality information into the multimodal representation for robust RGBT tracking. In particular, ProFormer first uses a self-attention module to collaboratively extract the multimodal representation, and then uses two cross-attention modules to interact it with the features of the dual modalities respectively. In this way, the modality-specific information can well be activated in the multimodal representation. Finally, a feed-forward network is used to fuse two interacted multimodal representations for the further enhancement of the final multimodal representation. In addition, existing learning methods of RGBT trackers either fuse multimodal features into one for final classification, or exploit the relationship between unimodal branches and fused branch through a competitive learning strategy. However, they either ignore the learning of single-modality branches or result in one branch failing to be well optimized. To solve these problems, we propose a dynamically guided learning algorithm that adaptively uses well-performing branches to guide the learning of other branches, for enhancing the representation ability of each branch. Extensive experiments demonstrate that our proposed ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.Comment: 13 pages, 9 figure

    The eighth visual object tracking VOT2020 challenge results

    Get PDF
    The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on “real-time” short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website

    RGB-T Tracking Based on Mixed Attention

    Full text link
    RGB-T tracking involves the use of images from both visible and thermal modalities. The primary objective is to adaptively leverage the relatively dominant modality in varying conditions to achieve more robust tracking compared to single-modality tracking. An RGB-T tracker based on mixed attention mechanism to achieve complementary fusion of modalities (referred to as MACFT) is proposed in this paper. In the feature extraction stage, we utilize different transformer backbone branches to extract specific and shared information from different modalities. By performing mixed attention operations in the backbone to enable information interaction and self-enhancement between the template and search images, it constructs a robust feature representation that better understands the high-level semantic features of the target. Then, in the feature fusion stage, a modality-adaptive fusion is achieved through a mixed attention-based modality fusion network, which suppresses the low-quality modality noise while enhancing the information of the dominant modality. Evaluation on multiple RGB-T public datasets demonstrates that our proposed tracker outperforms other RGB-T trackers on general evaluation metrics while also being able to adapt to longterm tracking scenarios.Comment: 14 pages, 10 figure

    Generative-based Fusion Mechanism for Multi-Modal Tracking

    Full text link
    Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained relatively unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. To quantitatively gauge the effectiveness of our approach, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and three challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance, setting new records on LasHeR and RGBD1K

    Methods to Robust Ranking of Object Trackers and to Tracker Drift Correction

    Get PDF
    This thesis explores two topics in video object tracking: (1) performance evaluation of tracking techniques, and (2) tracker drift detection and correction. Tracking performance evaluation consists into comparing a set of trackers' performance measures and ranking these trackers based on those measures. This is often done by computing performance averages over a video sequence and then over the entire test video dataset, consequently resulting in an important loss of statistical information of performance between frames of a video sequence and between the video sequences themselves. This work proposes two methods to evaluate trackers with respect to each other. The first method applies the median absolute deviation (MAD) to effectively analyze the similarities between trackers and iteratively ranks them into groups of similar performances. The second method gains inspiration from the use of robust error norms in anisotropic diffusion for image denoising to perform grouping and ranking of trackers. A total of 20 trackers are scored and ranked across four different benchmarks, and experimental results show that using our scoring evaluation is more robust than using the average over averages. In the second topic, we explore methods to the detection and correction of tracker drift. Drift detection refers to methods that detect if a tracker is about to drift or has drifted away while following a target object. Drift detection triggers a drift correction mechanism which updates the tracker's rectangular output bounding box. Most drift detection and correction algorithms are called while the target model is updating and are, thus, tracker-dependent. This work proposes a tracker-independent drift detection and correction method. For drift detection, we use a combination of saliency and objectness features to evaluate the likelihood an object exists inside a tracker's output. Once drift is detected, we run a region proposal network to reinitialize the bounding box output around the target object. Our implementation applied on two state-of-the-art trackers show that our method improves overall tracker performance measures when tested on three benchmarks