64,176 research outputs found
Long-Term Identity-Aware Multi-Person Tracking for Surveillance Video Summarization
Multi-person tracking plays a critical role in the analysis of surveillance
video. However, most existing work focus on shorter-term (e.g. minute-long or
hour-long) video sequences. Therefore, we propose a multi-person tracking
algorithm for very long-term (e.g. month-long) multi-camera surveillance
scenarios. Long-term tracking is challenging because 1) the apparel/appearance
of the same person will vary greatly over multiple days and 2) a person will
leave and re-enter the scene numerous times. To tackle these challenges, we
leverage face recognition information, which is robust to apparel change, to
automatically reinitialize our tracker over multiple days of recordings.
Unfortunately, recognized faces are unavailable oftentimes. Therefore, our
tracker propagates identity information to frames without recognized faces by
uncovering the appearance and spatial manifold formed by person detections. We
tested our algorithm on a 23-day 15-camera data set (4,935 hours total), and we
were able to localize a person 53.2% of the time with 69.8% precision. We
further performed video summarization experiments based on our tracking output.
Results on 116.25 hours of video showed that we were able to generate a
reasonable visual diary (i.e. a summary of what a person did) for different
people, thus potentially opening the door to automatic summarization of the
vast amount of surveillance video generated every day
Instance-Aware Representation Learning and Association for Online Multi-Person Tracking
Multi-Person Tracking (MPT) is often addressed within the
detection-to-association paradigm. In such approaches, human detections are
first extracted in every frame and person trajectories are then recovered by a
procedure of data association (usually offline). However, their performances
usually degenerate in presence of detection errors, mutual interactions and
occlusions. In this paper, we present a deep learning based MPT approach that
learns instance-aware representations of tracked persons and robustly online
infers states of the tracked persons. Specifically, we design a multi-branch
neural network (MBN), which predicts the classification confidences and
locations of all targets by taking a batch of candidate regions as input. In
our MBN architecture, each branch (instance-subnet) corresponds to an
individual to be tracked and new branches can be dynamically created for
handling newly appearing persons. Then based on the output of MBN, we construct
a joint association matrix that represents meaningful states of tracked persons
(e.g., being tracked or disappearing from the scene) and solve it by using the
efficient Hungarian algorithm. Moreover, we allow the instance-subnets to be
updated during tracking by online mining hard examples, accounting to person
appearance variations over time. We comprehensively evaluate our framework on a
popular MPT benchmark, demonstrating its excellent performance in comparison
with recent online MPT methods.Comment: accepted by Pattern Recognitio
Semantic Instance Meets Salient Object: Study on Video Semantic Salient Instance Segmentation
Focusing on only semantic instances that only salient in a scene gains more
benefits for robot navigation and self-driving cars than looking at all objects
in the whole scene. This paper pushes the envelope on salient regions in a
video to decompose them into semantically meaningful components, namely,
semantic salient instances. We provide the baseline for the new task of video
semantic salient instance segmentation (VSSIS), that is, Semantic Instance -
Salient Object (SISO) framework. The SISO framework is simple yet efficient,
leveraging advantages of two different segmentation tasks, i.e. semantic
instance segmentation and salient object segmentation to eventually fuse them
for the final result. In SISO, we introduce a sequential fusion by looking at
overlapping pixels between semantic instances and salient regions to have
non-overlapping instances one by one. We also introduce a recurrent instance
propagation to refine the shapes and semantic meanings of instances, and an
identity tracking to maintain both the identity and the semantic meaning of
instances over the entire video. Experimental results demonstrated the
effectiveness of our SISO baseline, which can handle occlusions in videos. In
addition, to tackle the task of VSSIS, we augment the DAVIS-2017 benchmark
dataset by assigning semantic ground-truth for salient instance labels,
obtaining SEmantic Salient Instance Video (SESIV) dataset. Our SESIV dataset
consists of 84 high-quality video sequences with pixel-wisely per-frame
ground-truth labels.Comment: accepted in WACV 201
LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
In this paper, we propose a novel effective light-weight framework, called
LightTrack, for online human pose tracking. The proposed framework is designed
to be generic for top-down pose tracking and is faster than existing online and
offline methods. Single-person Pose Tracking (SPT) and Visual Object Tracking
(VOT) are incorporated into one unified functioning entity, easily implemented
by a replaceable single-person pose estimation module. Our framework unifies
single-person pose tracking with multi-person identity association and sheds
first light upon bridging keypoint tracking with object tracking. We also
propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a
Re-ID module in our pose tracking system. In contrary to other Re-ID modules,
we use a graphical representation of human joints for matching. The
skeleton-based representation effectively captures human pose similarity and is
computationally inexpensive. It is robust to sudden camera shift that
introduces human drifting. To the best of our knowledge, this is the first
paper to propose an online human pose tracking framework in a top-down fashion.
The proposed framework is general enough to fit other pose estimators and
candidate matching mechanisms. Our method outperforms other online methods
while maintaining a much higher frame rate, and is very competitive with our
offline state-of-the-art. We make the code publicly available at:
https://github.com/Guanghan/lighttrack.Comment: 9 pages, 6 figures, 6 table
Multiple Object Tracking: A Literature Review
Multiple Object Tracking (MOT) is an important computer vision problem which
has gained increasing attention due to its academic and commercial potential.
Although different kinds of approaches have been proposed to tackle this
problem, it still remains challenging due to factors like abrupt appearance
changes and severe object occlusions. In this work, we contribute the first
comprehensive and most recent review on this problem. We inspect the recent
advances in various aspects and propose some interesting directions for future
research. To the best of our knowledge, there has not been any extensive review
on this topic in the community. We endeavor to provide a thorough review on the
development of this problem in recent decades. The main contributions of this
review are fourfold: 1) Key aspects in a multiple object tracking system,
including formulation, categorization, key principles, evaluation of an MOT are
discussed. 2) Instead of enumerating individual works, we discuss existing
approaches according to various aspects, in each of which methods are divided
into different groups and each group is discussed in detail for the principles,
advances and drawbacks. 3) We examine experiments of existing publications and
summarize results on popular datasets to provide quantitative comparisons. We
also point to some interesting discoveries by analyzing these results. 4) We
provide a discussion about issues of MOT research, as well as some interesting
directions which could possibly become potential research effort in the future
Deep Affinity Network for Multiple Object Tracking
Multiple Object Tracking (MOT) plays an important role in solving many
fundamental problems in video analysis in computer vision. Most MOT methods
employ two steps: Object Detection and Data Association. The first step detects
objects of interest in every frame of a video, and the second establishes
correspondence between the detected objects in different frames to obtain their
tracks. Object detection has made tremendous progress in the last few years due
to deep learning. However, data association for tracking still relies on hand
crafted constraints such as appearance, motion, spatial proximity, grouping
etc. to compute affinities between the objects in different frames. In this
paper, we harness the power of deep learning for data association in tracking
by jointly modelling object appearances and their affinities between different
frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN)
learns compact; yet comprehensive features of pre-detected objects at several
levels of abstraction, and performs exhaustive pairing permutations of those
features in any two frames to infer object affinities. DAN also accounts for
multiple objects appearing and disappearing between video frames. We exploit
the resulting efficient affinity computations to associate objects in the
current frame deep into the previous frames for reliable on-line tracking. Our
technique is evaluated on popular multiple object tracking challenges MOT15,
MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics
demonstrates that our approach is among the best performing techniques on the
leader board for these challenges. The open source implementation of our work
is available at https://github.com/shijieS/SST.git.Comment: To appear in IEEE TPAM
Tracklet Association Tracker: An End-to-End Learning-based Association Approach for Multi-Object Tracking
Traditional multiple object tracking methods divide the task into two parts:
affinity learning and data association. The separation of the task requires to
define a hand-crafted training goal in affinity learning stage and a
hand-crafted cost function of data association stage, which prevents the
tracking goals from learning directly from the feature. In this paper, we
present a new multiple object tracking (MOT) framework with data-driven
association method, named as Tracklet Association Tracker (TAT). The framework
aims at gluing feature learning and data association into a unity by a bi-level
optimization formulation so that the association results can be directly
learned from features. To boost the performance, we also adopt the popular
hierarchical association and perform the necessary alignment and selection of
raw detection responses. Our model trains over 20X faster than a similar
approach, and achieves the state-of-the-art performance on both MOT2016 and
MOT2017 benchmarks
A Hybrid Data Association Framework for Robust Online Multi-Object Tracking
Global optimization algorithms have shown impressive performance in
data-association based multi-object tracking, but handling online data remains
a difficult hurdle to overcome. In this paper, we present a hybrid data
association framework with a min-cost multi-commodity network flow for robust
online multi-object tracking. We build local target-specific models interleaved
with global optimization of the optimal data association over multiple video
frames. More specifically, in the min-cost multi-commodity network flow, the
target-specific similarities are online learned to enforce the local
consistency for reducing the complexity of the global data association.
Meanwhile, the global data association taking multiple video frames into
account alleviates irrecoverable errors caused by the local data association
between adjacent frames. To ensure the efficiency of online tracking, we give
an efficient near-optimal solution to the proposed min-cost multi-commodity
flow problem, and provide the empirical proof of its sub-optimality. The
comprehensive experiments on real data demonstrate the superior tracking
performance of our approach in various challenging situations
CloudAR: A Cloud-based Framework for Mobile Augmented Reality
Computation capabilities of recent mobile devices enable natural feature
processing for Augmented Reality (AR). However, mobile AR applications are
still faced with scalability and performance challenges. In this paper, we
propose CloudAR, a mobile AR framework utilizing the advantages of cloud and
edge computing through recognition task offloading. We explore the design space
of cloud-based AR exhaustively and optimize the offloading pipeline to minimize
the time and energy consumption. We design an innovative tracking system for
mobile devices which provides lightweight tracking in 6 degree of freedom
(6DoF) and hides the offloading latency from users' perception. We also design
a multi-object image retrieval pipeline that executes fast and accurate image
recognition tasks on servers. In our evaluations, the mobile AR application
built with the CloudAR framework runs at 30 frames per second (FPS) on average
with precise tracking of only 1~2 pixel errors and image recognition of at
least 97% accuracy. Our results also show that CloudAR outperforms one of the
leading commercial AR framework in several performance metrics
Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor
In this paper, we focus on the two key aspects of multiple target tracking
problem: 1) designing an accurate affinity measure to associate detections and
2) implementing an efficient and accurate (near) online multiple target
tracking algorithm. As the first contribution, we introduce a novel Aggregated
Local Flow Descriptor (ALFD) that encodes the relative motion pattern between a
pair of temporally distant detections using long term interest point
trajectories (IPTs). Leveraging on the IPTs, the ALFD provides a robust
affinity measure for estimating the likelihood of matching detections
regardless of the application scenarios. As another contribution, we present a
Near-Online Multi-target Tracking (NOMT) algorithm. The tracking problem is
formulated as a data-association between targets and detections in a temporal
window, that is performed repeatedly at every frame. While being efficient,
NOMT achieves robustness via integrating multiple cues including ALFD metric,
target dynamics, appearance similarity, and long term trajectory regularization
into the model. Our ablative analysis verifies the superiority of the ALFD
metric over the other conventional affinity metrics. We run a comprehensive
experimental evaluation on two challenging tracking datasets, KITTI and MOT
datasets. The NOMT method combined with ALFD metric achieves the best accuracy
in both datasets with significant margins (about 10% higher MOTA) over the
state-of-the-arts
- …