930 research outputs found
Rotation Adaptive Visual Object Tracking with Motion Consistency
Visual Object tracking research has undergone significant improvement in the
past few years. The emergence of tracking by detection approach in tracking
paradigm has been quite successful in many ways. Recently, deep convolutional
neural networks have been extensively used in most successful trackers. Yet,
the standard approach has been based on correlation or feature selection with
minimal consideration given to motion consistency. Thus, there is still a need
to capture various physical constraints through motion consistency which will
improve accuracy, robustness and more importantly rotation adaptiveness.
Therefore, one of the major aspects of this paper is to investigate the outcome
of rotation adaptiveness in visual object tracking. Among other key
contributions, the paper also includes various consistencies that turn out to
be extremely effective in numerous challenging sequences than the current
state-of-the-art.Comment: Accepted conference paper WACV 201
RANet: Ranking Attention Network for Fast Video Object Segmentation
Despite online learning (OL) techniques have boosted the performance of
semi-supervised video object segmentation (VOS) methods, the huge time costs of
OL greatly restrict their practicality. Matching based and propagation based
methods run at a faster speed by avoiding OL techniques. However, they are
limited by sub-optimal accuracy, due to mismatching and drifting problems. In
this paper, we develop a real-time yet very accurate Ranking Attention Network
(RANet) for VOS. Specifically, to integrate the insights of matching based and
propagation based methods, we employ an encoder-decoder framework to learn
pixel-level similarity and segmentation in an end-to-end manner. To better
utilize the similarity maps, we propose a novel ranking attention module, which
automatically ranks and selects these maps for fine-grained VOS performance.
Experiments on DAVIS-16 and DAVIS-17 datasets show that our RANet achieves the
best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and
J&F=85.5% on DAVIS-16. With OL, our RANet reaches J&F=87.1% on DAVIS-16,
exceeding state-of-the-art VOS methods. The code can be found at
https://github.com/Storife/RANet.Comment: Accepted by ICCV 2019. 10 pages, 7 figures, 6 tables. The
supplementary file can be found at
https://csjunxu.github.io/paper/2019ICCV/RANet_supp.pdf ; Code is available
at https://github.com/Storife/RANe
LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
In this paper, we propose a novel effective light-weight framework, called
LightTrack, for online human pose tracking. The proposed framework is designed
to be generic for top-down pose tracking and is faster than existing online and
offline methods. Single-person Pose Tracking (SPT) and Visual Object Tracking
(VOT) are incorporated into one unified functioning entity, easily implemented
by a replaceable single-person pose estimation module. Our framework unifies
single-person pose tracking with multi-person identity association and sheds
first light upon bridging keypoint tracking with object tracking. We also
propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a
Re-ID module in our pose tracking system. In contrary to other Re-ID modules,
we use a graphical representation of human joints for matching. The
skeleton-based representation effectively captures human pose similarity and is
computationally inexpensive. It is robust to sudden camera shift that
introduces human drifting. To the best of our knowledge, this is the first
paper to propose an online human pose tracking framework in a top-down fashion.
The proposed framework is general enough to fit other pose estimators and
candidate matching mechanisms. Our method outperforms other online methods
while maintaining a much higher frame rate, and is very competitive with our
offline state-of-the-art. We make the code publicly available at:
https://github.com/Guanghan/lighttrack.Comment: 9 pages, 6 figures, 6 table
Target-Aware Deep Tracking
Existing deep trackers mainly use convolutional neural networks pre-trained
for generic object recognition task for representations. Despite demonstrated
successes for numerous vision tasks, the contributions of using pre-trained
deep features for visual tracking are not as significant as that for object
recognition. The key issue is that in visual tracking the targets of interest
can be arbitrary object class with arbitrary forms. As such, pre-trained deep
features are less effective in modeling these targets of arbitrary forms for
distinguishing them from the background. In this paper, we propose a novel
scheme to learn target-aware features, which can better recognize the targets
undergoing significant appearance variations than pre-trained deep features. To
this end, we develop a regression loss and a ranking loss to guide the
generation of target-active and scale-sensitive features. We identify the
importance of each convolutional filter according to the back-propagated
gradients and select the target-aware features based on activations for
representing the targets. The target-aware features are integrated with a
Siamese matching network for visual tracking. Extensive experimental results
show that the proposed algorithm performs favorably against the
state-of-the-art methods in terms of accuracy and speed.Comment: To appear in CVPR 201
Recurrent Filter Learning for Visual Tracking
Recently using convolutional neural networks (CNNs) has gained popularity in
visual tracking, due to its robust feature representation of images. Recent
methods perform online tracking by fine-tuning a pre-trained CNN model to the
specific target object using stochastic gradient descent (SGD)
back-propagation, which is usually time-consuming. In this paper, we propose a
recurrent filter generation methods for visual tracking. We directly feed the
target's image patch to a recurrent neural network (RNN) to estimate an
object-specific filter for tracking. As the video sequence is a spatiotemporal
data, we extend the matrix multiplications of the fully-connected layers of the
RNN to a convolution operation on feature maps, which preserves the target's
spatial structure and also is memory-efficient. The tracked object in the
subsequent frames will be fed into the RNN to adapt the generated filters to
appearance variations of the target. Note that once the off-line training
process of our network is finished, there is no need to fine-tune the network
for specific objects, which makes our approach more efficient than methods that
use iterative fine-tuning to online learn the target. Extensive experiments
conducted on widely used benchmarks, OTB and VOT, demonstrate encouraging
results compared to other recent methods.Comment: ICCV2017 Workshop on VO
An In-Depth Analysis of Visual Tracking with Siamese Neural Networks
This survey presents a deep analysis of the learning and inference
capabilities in nine popular trackers. It is neither intended to study the
whole literature nor is it an attempt to review all kinds of neural networks
proposed for visual tracking. We focus instead on Siamese neural networks which
are a promising starting point for studying the challenging problem of
tracking. These networks integrate efficiently feature learning and the
temporal matching and have so far shown state-of-the-art performance. In
particular, the branches of Siamese networks, their layers connecting these
branches, specific aspects of training and the embedding of these networks into
the tracker are highlighted. Quantitative results from existing papers are
compared with the conclusion that the current evaluation methodology shows
problems with the reproducibility and the comparability of results. The paper
proposes a novel Lisp-like formalism for a better comparison of trackers. This
assumes a certain functional design and functional decomposition of trackers.
The paper tries to give foundation for tracker design by a formulation of the
problem based on the theory of machine learning and by the interpretation of a
tracker as a decision function. The work concludes with promising lines of
research and suggests future work.Comment: submitted to IEEE TPAM
CRACT: Cascaded Regression-Align-Classification for Robust Visual Tracking
High quality object proposals are crucial in visual tracking algorithms that
utilize region proposal network (RPN). Refinement of these proposals, typically
by box regression and classification in parallel, has been popularly adopted to
boost tracking performance. However, it still meets problems when dealing with
complex and dynamic background. Thus motivated, in this paper we introduce an
improved proposal refinement module, Cascaded Regression-Align-Classification
(CRAC), which yields new state-of-the-art performances on many benchmarks.
First, having observed that the offsets from box regression can serve as
guidance for proposal feature refinement, we design CRAC as a cascade of box
regression, feature alignment and box classification. The key is to bridge box
regression and classification via an alignment step, which leads to more
accurate features for proposal classification with improved robustness. To
address the variation in object appearance, we introduce an
identification-discrimination component for box classification, which leverages
offline reliable fine-grained template and online rich background information
to distinguish the target from background. Moreover, we present pyramid
RoIAlign that benefits CRAC by exploiting both the local and global cues of
proposals. During inference, tracking proceeds by ranking all refined proposals
and selecting the best one. In experiments on seven benchmarks including
OTB-2015, UAV123, NfS, VOT-2018, TrackingNet, GOT-10k and LaSOT, our CRACT
exhibits very promising results in comparison with state-of-the-art competitors
and runs in real-time.Comment: tech. repor
Visual Object Tracking based on Adaptive Siamese and Motion Estimation Network
Recently, convolutional neural network (CNN) has attracted much attention in
different areas of computer vision, due to its powerful abstract feature
representation. Visual object tracking is one of the interesting and important
areas in computer vision that achieves remarkable improvements in recent years.
In this work, we aim to improve both the motion and observation models in
visual object tracking by leveraging representation power of CNNs. To this end,
a motion estimation network (named MEN) is utilized to seek the most likely
locations of the target and prepare a further clue in addition to the previous
target position. Hence the motion estimation would be enhanced by generating a
small number of candidates near two plausible positions. The generated
candidates are then fed into a trained Siamese network to detect the most
probable candidate. Each candidate is compared to an adaptable buffer, which is
updated under a predefined condition. To take into account the target
appearance changes, a weighting CNN (called WCNN) adaptively assigns weights to
the final similarity scores of the Siamese network using sequence-specific
information. Evaluation results on well-known benchmark datasets (OTB100, OTB50
and OTB2013) prove that the proposed tracker outperforms the state-of-the-art
competitors.Comment: 28 pages, 1 algorithm, 7 figures, 2 table, Submitted to Elsevier,
Image and Vision Computin
Unsupervised Deep Tracking
We propose an unsupervised visual tracking method in this paper. Different
from existing approaches using extensive annotated data for supervised
learning, our CNN model is trained on large-scale unlabeled videos in an
unsupervised manner. Our motivation is that a robust tracker should be
effective in both the forward and backward predictions (i.e., the tracker can
forward localize the target object in successive frames and backtrace to its
initial position in the first frame). We build our framework on a Siamese
correlation filter network, which is trained using unlabeled raw videos.
Meanwhile, we propose a multiple-frame validation method and a cost-sensitive
loss to facilitate unsupervised learning. Without bells and whistles, the
proposed unsupervised tracker achieves the baseline accuracy of fully
supervised trackers, which require complete and accurate labels during
training. Furthermore, unsupervised framework exhibits a potential in
leveraging unlabeled or weakly labeled data to further improve the tracking
accuracy.Comment: to appear in CVPR 201
Prediction-Tracking-Segmentation
We introduce a prediction driven method for visual tracking and segmentation
in videos. Instead of solely relying on matching with appearance cues for
tracking, we build a predictive model which guides finding more accurate
tracking regions efficiently. With the proposed prediction mechanism, we
improve the model robustness against distractions and occlusions during
tracking. We demonstrate significant improvements over state-of-the-art methods
not only on visual tracking tasks (VOT 2016 and VOT 2018) but also on video
segmentation datasets (DAVIS 2016 and DAVIS 2017)
- …