169 research outputs found
Robust Visual Tracking Revisited: From Correlation Filter to Template Matching
In this paper, we propose a novel matching based tracker by investigating the
relationship between template matching and the recent popular correlation
filter based trackers (CFTs). Compared to the correlation operation in CFTs, a
sophisticated similarity metric termed "mutual buddies similarity" (MBS) is
proposed to exploit the relationship of multiple reciprocal nearest neighbors
for target matching. By doing so, our tracker obtains powerful discriminative
ability on distinguishing target and background as demonstrated by both
empirical and theoretical analyses. Besides, instead of utilizing single
template with the improper updating scheme in CFTs, we design a novel online
template updating strategy named "memory filtering" (MF), which aims to select
a certain amount of representative and reliable tracking results in history to
construct the current stable and expressive template set. This scheme is
beneficial for the proposed tracker to comprehensively "understand" the target
appearance variations, "recall" some stable results. Both qualitative and
quantitative evaluations on two benchmarks suggest that the proposed tracking
method performs favorably against some recently developed CFTs and other
competitive trackers.Comment: has been published on IEEE TI
End-to-end representation learning for Correlation Filter based tracking
The Correlation Filter is an algorithm that trains a linear template to
discriminate between images and their translations. It is well suited to object
tracking because its formulation in the Fourier domain provides a fast
solution, enabling the detector to be re-trained once per frame. Previous works
that use the Correlation Filter, however, have adopted features that were
either manually designed or trained for a different task. This work is the
first to overcome this limitation by interpreting the Correlation Filter
learner, which has a closed-form solution, as a differentiable layer in a deep
neural network. This enables learning deep features that are tightly coupled to
the Correlation Filter. Experiments illustrate that our method has the
important practical benefit of allowing lightweight architectures to achieve
state-of-the-art performance at high framerates.Comment: To appear at CVPR 201
Feature Distilled Tracking
Feature extraction and representation is one of the most important components for fast, accurate, and robust visual tracking. Very deep convolutional neural networks (CNNs) provide effective tools for feature extraction with good generalization ability. However, extracting features using very deep CNN models needs high performance hardware due to its large computation complexity, which prohibits its extensions in real-time applications. To alleviate this problem, we aim at obtaining small and fast-to-execute shallow models based on model compression for visual tracking. Specifically, we propose a small feature distilled network (FDN) for tracking by imitating the intermediate representations of a much deeper network. The FDN extracts rich visual features with higher speed than the original deeper network. To further speed-up, we introduce a shift-and-stitch method to reduce the arithmetic operations, while preserving the spatial resolution of the distilled feature maps unchanged. Finally, a scale adaptive discriminative correlation filter is learned on the distilled feature for visual tracking to handle scale variation of the target. Comprehensive experimental results on object tracking benchmark datasets show that the proposed approach achieves 5x speed-up with competitive performance to the state-of-the-art deep trackers
Contrastive Transformation for Self-supervised Correspondence Learning
In this paper, we focus on the self-supervised learning of visual
correspondence using unlabeled videos in the wild. Our method simultaneously
considers intra- and inter-video representation associations for reliable
correspondence estimation. The intra-video learning transforms the image
contents across frames within a single video via the frame pair-wise affinity.
To obtain the discriminative representation for instance-level separation, we
go beyond the intra-video analysis and construct the inter-video affinity to
facilitate the contrastive transformation across different videos. By forcing
the transformation consistency between intra- and inter-video levels, the
fine-grained correspondence associations are well preserved and the
instance-level feature discrimination is effectively reinforced. Our simple
framework outperforms the recent self-supervised correspondence methods on a
range of visual tasks including video object tracking (VOT), video object
segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that
our method also surpasses the fully-supervised affinity representation (e.g.,
ResNet) and performs competitively against the recent fully-supervised
algorithms designed for the specific tasks (e.g., VOT and VOS).Comment: To appear in AAAI 202
Visual tracking with online assessment and improved sampling strategy
The kernelized correlation filter (KCF) is one of the most successful trackers in computer vision today. However its performance may be significantly degraded in a wide range of challenging conditions such as occlusion and out of view. For many applications, particularly safety critical applications (e.g. autonomous driving), it is of profound importance to have consistent and reliable performance during all the operation conditions. This paper addresses this issue of the KCF based trackers by the introduction of two novel modules, namely online assessment of response map, and a strategy of combining cyclically shifted sampling with random sampling in deep feature space. A method of online assessment of response map is proposed to evaluate the tracking performance by constructing a 2-D Gaussian estimation model. Then a strategy of combining cyclically shifted sampling with random sampling in deep feature space is presented to improve the tracking performance when the tracking performance is assessed to be unreliable based on the response map. Therefore, the module of online assessment can be regarded as the trigger for the second module. Experiments verify the tracking performance is significantly improved particularly in challenging conditions as demonstrated by both quantitative and qualitative comparisons of the proposed tracking algorithm with the state-of-the-art tracking algorithms on OTB-2013 and OTB-2015 datasets
SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning
A steady momentum of innovations and breakthroughs has convincingly pushed
the limits of unsupervised image representation learning. Compared to static 2D
images, video has one more dimension (time). The inherent supervision existing
in such sequential structure offers a fertile ground for building unsupervised
learning models. In this paper, we compose a trilogy of exploring the basic and
generic supervision in the sequence from spatial, spatiotemporal and sequential
perspectives. We materialize the supervisory signals through determining
whether a pair of samples is from one frame or from one video, and whether a
triplet of samples is in the correct temporal order. We uniquely regard the
signals as the foundation in contrastive learning and derive a particular form
named Sequence Contrastive Learning (SeCo). SeCo shows superior results under
the linear protocol on action recognition (Kinetics), untrimmed activity
recognition (ActivityNet) and object tracking (OTB-100). More remarkably, SeCo
demonstrates considerable improvements over recent unsupervised pre-training
techniques, and leads the accuracy by 2.96% and 6.47% against fully-supervised
ImageNet pre-training in action recognition task on UCF101 and HMDB51,
respectively. Source code is available at
\url{https://github.com/YihengZhang-CV/SeCo-Sequence-Contrastive-Learning}.Comment: AAAI 2021; Code is publicly available at:
https://github.com/YihengZhang-CV/SeCo-Sequence-Contrastive-Learnin
- …