2 research outputs found
Robust Visual Object Tracking with Two-Stream Residual Convolutional Networks
The current deep learning based visual tracking approaches have been very
successful by learning the target classification and/or estimation model from a
large amount of supervised training data in offline mode. However, most of them
can still fail in tracking objects due to some more challenging issues such as
dense distractor objects, confusing background, motion blurs, and so on.
Inspired by the human "visual tracking" capability which leverages motion cues
to distinguish the target from the background, we propose a Two-Stream Residual
Convolutional Network (TS-RCN) for visual tracking, which successfully exploits
both appearance and motion features for model update. Our TS-RCN can be
integrated with existing deep learning based visual trackers. To further
improve the tracking performance, we adopt a "wider" residual network ResNeXt
as its feature extraction backbone. To the best of our knowledge, TS-RCN is the
first end-to-end trainable two-stream visual tracking system, which makes full
use of both appearance and motion features of the target. We have extensively
evaluated the TS-RCN on most widely used benchmark datasets including VOT2018,
VOT2019, and GOT-10K. The experiment results have successfully demonstrated
that our two-stream model can greatly outperform the appearance based tracker,
and it also achieves state-of-the-art performance. The tracking system can run
at up to 38.1 FPS
TDIOT: Target-driven Inference for Deep Video Object Tracking
Recent tracking-by-detection approaches use deep object detectors as target
detection baseline, because of their high performance on still images. For
effective video object tracking, object detection is integrated with a data
association step performed by either a custom design inference architecture or
an end-to-end joint training for tracking purpose. In this work, we adopt the
former approach and use the pre-trained Mask R-CNN deep object detector as the
baseline. We introduce a novel inference architecture placed on top of
FPN-ResNet101 backbone of Mask R-CNN to jointly perform detection and tracking,
without requiring additional training for tracking purpose. The proposed single
object tracker, TDIOT, applies an appearance similarity-based temporal matching
for data association. In order to tackle tracking discontinuities, we
incorporate a local search and matching module into the inference head layer
that exploits SiamFC for short term tracking. Moreover, in order to improve
robustness to scale changes, we introduce a scale adaptive region proposal
network that enables to search the target at an adaptively enlarged spatial
neighborhood specified by the trace of the target. In order to meet long term
tracking requirements, a low cost verification layer is incorporated into the
inference architecture to monitor presence of the target based on its LBP
histogram model. Performance evaluation on videos from VOT2016, VOT2018 and
VOT-LT2018 datasets demonstrate that TDIOT achieves higher accuracy compared to
the state-of-the-art short-term trackers while it provides comparable
performance in long term tracking