10 research outputs found
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning
Existing Transformer-based RGBT tracking methods either use cross-attention
to fuse the two modalities, or use self-attention and cross-attention to model
both modality-specific and modality-sharing information. However, the
significant appearance gap between modalities limits the feature representation
ability of certain modalities during the fusion process. To address this
problem, we propose a novel Progressive Fusion Transformer called ProFormer,
which progressively integrates single-modality information into the multimodal
representation for robust RGBT tracking. In particular, ProFormer first uses a
self-attention module to collaboratively extract the multimodal representation,
and then uses two cross-attention modules to interact it with the features of
the dual modalities respectively. In this way, the modality-specific
information can well be activated in the multimodal representation. Finally, a
feed-forward network is used to fuse two interacted multimodal representations
for the further enhancement of the final multimodal representation. In
addition, existing learning methods of RGBT trackers either fuse multimodal
features into one for final classification, or exploit the relationship between
unimodal branches and fused branch through a competitive learning strategy.
However, they either ignore the learning of single-modality branches or result
in one branch failing to be well optimized. To solve these problems, we propose
a dynamically guided learning algorithm that adaptively uses well-performing
branches to guide the learning of other branches, for enhancing the
representation ability of each branch. Extensive experiments demonstrate that
our proposed ProFormer sets a new state-of-the-art performance on RGBT210,
RGBT234, LasHeR, and VTUAV datasets.Comment: 13 pages, 9 figure
The eighth visual object tracking VOT2020 challenge results
The Visual Object Tracking challenge VOT2020 is the eighth
annual tracker benchmarking activity organized by the VOT initiative.
Results of 58 trackers are presented; many are state-of-the-art trackers
published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges
focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on “real-time” short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance
and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term
tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only
the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and
introduction of segmentation ground truth in the VOT-ST2020 challenge
– bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds
standard baselines. The source code for most of the trackers is publicly
available from the VOT page. The dataset, the evaluation kit and the
results are publicly available at the challenge website
RGB-T Tracking Based on Mixed Attention
RGB-T tracking involves the use of images from both visible and thermal
modalities. The primary objective is to adaptively leverage the relatively
dominant modality in varying conditions to achieve more robust tracking
compared to single-modality tracking. An RGB-T tracker based on mixed attention
mechanism to achieve complementary fusion of modalities (referred to as MACFT)
is proposed in this paper. In the feature extraction stage, we utilize
different transformer backbone branches to extract specific and shared
information from different modalities. By performing mixed attention operations
in the backbone to enable information interaction and self-enhancement between
the template and search images, it constructs a robust feature representation
that better understands the high-level semantic features of the target. Then,
in the feature fusion stage, a modality-adaptive fusion is achieved through a
mixed attention-based modality fusion network, which suppresses the low-quality
modality noise while enhancing the information of the dominant modality.
Evaluation on multiple RGB-T public datasets demonstrates that our proposed
tracker outperforms other RGB-T trackers on general evaluation metrics while
also being able to adapt to longterm tracking scenarios.Comment: 14 pages, 10 figure
Generative-based Fusion Mechanism for Multi-Modal Tracking
Generative models (GMs) have received increasing research interest for their
remarkable capacity to achieve comprehensive understanding. However, their
potential application in the domain of multi-modal tracking has remained
relatively unexplored. In this context, we seek to uncover the potential of
harnessing generative techniques to address the critical challenge, information
fusion, in multi-modal tracking. In this paper, we delve into two prominent GM
techniques, namely, Conditional Generative Adversarial Networks (CGANs) and
Diffusion Models (DMs). Different from the standard fusion process where the
features from each modality are directly fed into the fusion block, we
condition these multi-modal features with random noise in the GM framework,
effectively transforming the original training samples into harder instances.
This design excels at extracting discriminative clues from the features,
enhancing the ultimate tracking performance. To quantitatively gauge the
effectiveness of our approach, we conduct extensive experiments across two
multi-modal tracking tasks, three baseline methods, and three challenging
benchmarks. The experimental results demonstrate that the proposed
generative-based fusion mechanism achieves state-of-the-art performance,
setting new records on LasHeR and RGBD1K
Methods to Robust Ranking of Object Trackers and to Tracker Drift Correction
This thesis explores two topics in video object tracking: (1) performance evaluation of tracking techniques, and (2) tracker drift detection and correction. Tracking performance evaluation consists into comparing a set of trackers' performance measures and ranking these trackers based on those measures. This is often done by computing performance averages over a video sequence and then over the entire test video dataset, consequently resulting in an important loss of statistical information of performance between frames of a video sequence and between the video sequences themselves. This work proposes two methods to evaluate trackers with respect to each other. The first method applies the median absolute deviation (MAD) to effectively analyze the similarities between trackers and iteratively ranks them into groups of similar performances. The second method gains inspiration from the use of robust error norms in anisotropic diffusion for image denoising to perform grouping and ranking of trackers. A total of 20 trackers are scored and ranked across four different benchmarks, and experimental results show that using our scoring evaluation is more robust than using the average over averages.
In the second topic, we explore methods to the detection and correction of tracker drift. Drift detection refers to methods that detect if a tracker is about to drift or has drifted away while following a target object. Drift detection triggers a drift correction mechanism which updates the tracker's rectangular output bounding box. Most drift detection and correction algorithms are called while the target model is updating and are, thus, tracker-dependent. This work proposes a tracker-independent drift detection and correction method. For drift detection, we use a combination of saliency and objectness features to evaluate the likelihood an object exists inside a tracker's output. Once drift is detected, we run a region proposal network to reinitialize the bounding box output around the target object. Our implementation applied on two state-of-the-art trackers show that our method improves overall tracker performance measures when tested on three benchmarks