98,146 research outputs found
RPT: Learning Point Set Representation for Siamese Visual Tracking
While remarkable progress has been made in robust visual tracking, accurate
target state estimation still remains a highly challenging problem. In this
paper, we argue that this issue is closely related to the prevalent bounding
box representation, which provides only a coarse spatial extent of object. Thus
an effcient visual tracking framework is proposed to accurately estimate the
target state with a finer representation as a set of representative points. The
point set is trained to indicate the semantically and geometrically significant
positions of target region, enabling more fine-grained localization and
modeling of object appearance. We further propose a multi-level aggregation
strategy to obtain detailed structure information by fusing hierarchical
convolution layers. Extensive experiments on several challenging benchmarks
including OTB2015, VOT2018, VOT2019 and GOT-10k demonstrate that our method
achieves new state-of-the-art performance while running at over 20 FPS.Comment: Accepted to ECCV2020 Worksho
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking
Most thermal infrared (TIR) tracking methods are discriminative, treating the
tracking problem as a classification task. However, the objective of the
classifier (label prediction) is not coupled to the objective of the tracker
(location estimation). The classification task focuses on the between-class
difference of the arbitrary objects, while the tracking task mainly deals with
the within-class difference of the same objects. In this paper, we cast the TIR
tracking problem as a similarity verification task, which is coupled well to
the objective of the tracking task. We propose a TIR tracker via a Hierarchical
Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To
obtain both spatial and semantic features of the TIR object, we design a
Siamese CNN that coalesces the multiple hierarchical convolutional layers.
Then, we propose a spatial-aware network to enhance the discriminative ability
of the coalesced hierarchical feature. Subsequently, we train this network end
to end on a large visible video detection dataset to learn the similarity
between paired objects before we transfer the network into the TIR domain.
Next, this pre-trained Siamese network is used to evaluate the similarity
between the target template and target candidates. Finally, we locate the
candidate that is most similar to the tracked target. Extensive experimental
results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed
method achieves favourable performance compared to the state-of-the-art
methods.Comment: 20 pages, 7 figure
Kernalised Multi-resolution Convnet for Visual Tracking
Visual tracking is intrinsically a temporal problem. Discriminative
Correlation Filters (DCF) have demonstrated excellent performance for
high-speed generic visual object tracking. Built upon their seminal work, there
has been a plethora of recent improvements relying on convolutional neural
network (CNN) pretrained on ImageNet as a feature extractor for visual
tracking. However, most of their works relying on ad hoc analysis to design the
weights for different layers either using boosting or hedging techniques as an
ensemble tracker. In this paper, we go beyond the conventional DCF framework
and propose a Kernalised Multi-resolution Convnet (KMC) formulation that
utilises hierarchical response maps to directly output the target movement.
When directly deployed the learnt network to predict the unseen challenging UAV
tracking dataset without any weight adjustment, the proposed model consistently
achieves excellent tracking performance. Moreover, the transfered
multi-reslution CNN renders it possible to be integrated into the RNN temporal
learning framework, therefore opening the door on the end-to-end temporal deep
learning (TDL) for visual tracking.Comment: CVPRW 201
Robust Object Tracking with a Hierarchical Ensemble Framework
Autonomous robots enjoy a wide popularity nowadays and have been applied in
many applications, such as home security, entertainment, delivery, navigation
and guidance. It is vital to robots to track objects accurately in these
applications, so it is necessary to focus on tracking algorithms to improve the
robustness and accuracy. In this paper, we propose a robust object tracking
algorithm based on a hierarchical ensemble framework which can incorporate
information including individual pixel features, local patches and holistic
target models. The framework combines multiple ensemble models simultaneously
instead of using a single ensemble model individually. A discriminative model
which accounts for the matching degree of local patches is adopted via a bottom
ensemble layer, and a generative model which exploits holistic templates is
used to search for the object through the middle ensemble layer as well as an
adaptive Kalman filter. We test the proposed tracker on challenging benchmark
image sequences. Both qualitative and quantitative evaluations demonstrate that
the proposed tracker performs superiorly against several state-of-the-art
algorithms, especially when the appearance changes dramatically and the
occlusions occur
Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey
Deep learning has recently achieved very promising results in a wide range of
areas such as computer vision, speech recognition and natural language
processing. It aims to learn hierarchical representations of data by using deep
architecture models. In a smart city, a lot of data (e.g. videos captured from
many distributed sensors) need to be automatically processed and analyzed. In
this paper, we review the deep learning algorithms applied to video analytics
of smart city in terms of different research topics: object detection, object
tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure
Online Object Tracking, Learning and Parsing with And-Or Graphs
This paper presents a method, called AOGTracker, for simultaneously tracking,
learning and parsing (TLP) of unknown objects in video sequences with a
hierarchical and compositional And-Or graph (AOG) representation. %The AOG
captures both structural and appearance variations of a target object in a
principled way. The TLP method is formulated in the Bayesian framework with a
spatial and a temporal dynamic programming (DP) algorithms inferring object
bounding boxes on-the-fly. During online learning, the AOG is discriminatively
learned using latent SVM to account for appearance (e.g., lighting and partial
occlusion) and structural (e.g., different poses and viewpoints) variations of
a tracked object, as well as distractors (e.g., similar objects) in background.
Three key issues in online inference and learning are addressed: (i)
maintaining purity of positive and negative examples collected online, (ii)
controling model complexity in latent structure learning, and (iii) identifying
critical moments to re-learn the structure of AOG based on its intrackability.
The intrackability measures uncertainty of an AOG based on its score maps in a
frame. In experiments, our AOGTracker is tested on two popular tracking
benchmarks with the same parameter setting: the TB-100/50/CVPR2013 benchmarks,
and the VOT benchmarks --- VOT 2013, 2014, 2015 and TIR2015 (thermal imagery
tracking). In the former, our AOGTracker outperforms state-of-the-art tracking
algorithms including two trackers based on deep convolutional network. In the
latter, our AOGTracker outperforms all other trackers in VOT2013 and is
comparable to the state-of-the-art methods in VOT2014, 2015 and TIR2015.Comment: 17 pages, Reproducibility: The source code is released with this
paper for reproducing all results, which is available at
https://github.com/tfwu/RGM-AOGTracke
Robust Visual Tracking via Convolutional Networks
Deep networks have been successfully applied to visual tracking by learning a
generic representation offline from numerous training images. However the
offline training is time-consuming and the learned generic representation may
be less discriminative for tracking specific objects. In this paper we present
that, even without offline training with a large amount of auxiliary data,
simple two-layer convolutional networks can be powerful enough to develop a
robust representation for visual tracking. In the first frame, we employ the
k-means algorithm to extract a set of normalized patches from the target region
as fixed filters, which integrate a series of adaptive contextual filters
surrounding the target to define a set of feature maps in the subsequent
frames. These maps measure similarities between each filter and the useful
local intensity patterns across the target, thereby encoding its local
structural information. Furthermore, all the maps form together a global
representation, which is built on mid-level features, thereby remaining close
to image-level information, and hence the inner geometric layout of the target
is also well preserved. A simple soft shrinkage method with an adaptive
threshold is employed to de-noise the global representation, resulting in a
robust sparse representation. The representation is updated via a simple and
effective online strategy, allowing it to robustly adapt to target appearance
variations. Our convolution networks have surprisingly lightweight structure,
yet perform favorably against several state-of-the-art methods on the CVPR2013
tracking benchmark dataset with 50 challenging videos
Positive factor networks: A graphical framework for modeling non-negative sequential data
We present a novel graphical framework for modeling non-negative sequential
data with hierarchical structure. Our model corresponds to a network of coupled
non-negative matrix factorization (NMF) modules, which we refer to as a
positive factor network (PFN). The data model is linear, subject to
non-negativity constraints, so that observation data consisting of an additive
combination of individually representable observations is also representable by
the network. This is a desirable property for modeling problems in
computational auditory scene analysis, since distinct sound sources in the
environment are often well-modeled as combining additively in the corresponding
magnitude spectrogram. We propose inference and learning algorithms that
leverage existing NMF algorithms and that are straightforward to implement. We
present a target tracking example and provide results for synthetic observation
data which serve to illustrate the interesting properties of PFNs and motivate
their potential usefulness in applications such as music transcription, source
separation, and speech recognition. We show how a target process characterized
by a hierarchical state transition model can be represented as a PFN. Our
results illustrate that a PFN which is defined in terms of a single target
observation can then be used to effectively track the states of multiple
simultaneous targets. Our results show that the quality of the inferred target
states degrades gradually as the observation noise is increased. We also
present results for an example in which meaningful hierarchical features are
extracted from a spectrogram. Such a hierarchical representation could be
useful for music transcription and source separation applications. We also
propose a network for language modeling.Comment: Minor editing of the abstract, introduction, and concluding sections
to improve readability and remove redundant wording, based on feedback from a
reviewer. No changes were made to the material presented nor to the results.
Added an acknowledgment section to thank the reviewer. Corrected minor typo
An Integrated Approach to Crowd Video Analysis: From Tracking to Multi-level Activity Recognition
We present an integrated framework for simultaneous tracking, group detection
and multi-level activity recognition in crowd videos. Instead of solving these
problems independently and sequentially, we solve them together in a unified
framework to utilize the strong correlation that exists among individual
motion, groups, and activities. We explore the hierarchical structure hidden in
the video that connects individuals over time to produce tracks, connects
individuals to form groups and also connects groups together to form a crowd.
We show that estimation of this hidden structure corresponds to track
association and group detection. We estimate this hidden structure under a
linear programming formulation. The obtained graphical representation is
further explored to recognize the node values that corresponds to multi-level
activity recognition. This problem is solved under a structured SVM framework.
The results on publicly available dataset show very competitive performance at
all levels of granularity with the state-of-the-art batch processing methods
despite the proposed technique being an online (causal) one.Comment: 10 page
- …