116,557 research outputs found
Generic multiple object tracking
Multiple object tracking is an important problem in the computer vision community due to its applications, including but not limited to, visual surveillance, crowd behavior analysis and robotics. The difficulties of this problem lie in several challenges such as frequent occlusion,
interaction, high-degree articulation, etc. In recent years, data association based approaches have been successful in tracking multiple pedestrians on top of specific kinds of object detectors. Thus these approaches are type-specific. This may constrain their application in scenario where type-specific object detectors are unavailable. In view of this, I investigate in this thesis tracking multiple objects without ready-to-use and type-specific object detectors. More specifically, the problem of multiple object tracking is generalized to tracking targets of a generic type. Namely, objects to be tracked are no longer constrained to be a specific kind of objects. This problem is termed as Generic Multiple Object Tracking (GMOT), which is handled by three approaches presented in this thesis. In the first approach, a generic object detector is learned based on manual annotation of only one initial bounding box. Then the detector is employed to regularize the online learning procedure of multiple trackers which are specialized to each object. More specifically, multiple trackers are learned simultaneously with shared features and are guided to keep close to the detector. Experimental results have shown considerable improvement on this problem compared with the state-of-the-art methods. The second approach treats detection and tracking of
multiple generic objects as a bi-label propagation procedure, which is consisted of class label
propagation (detection) and object label propagation (tracking). In particular, the cluster Multiple Task Learning (cMTL) is employed along with the spatio-temporal consistency to address
the online detection problem. The tracking problem is addressed by associating existing trajectories with new detection responses considering appearance, motion and context information. The advantages of this approach is verified by extensive experiments on several public data sets. The aforementioned two approaches handle GMOT in an online manner. In contrast, a batch method is proposed in the third work. It dynamically clusters given detection hypotheses into groups corresponding to individual objects. Inspired by the success of topic model in tackling textual tasks, Dirichlet Process Mixture Model (DPMM) is utilized to address the tracking problem by cooperating with the so-called must-links and cannot-links, which are proposed to avoid physical collision. Moreover, two kinds of representations, superpixel and Deformable Part Model (DPM), are introduced to track both rigid and non-rigid objects. Effectiveness of the proposed method is demonstrated with experiments on public data sets.Open Acces
Z-GMOT: Zero-shot Generic Multiple Object Tracking
Despite the significant progress made in recent years, Multi-Object Tracking
(MOT) approaches still suffer from several limitations, including their
reliance on prior knowledge of tracking targets, which necessitates the costly
annotation of large labeled datasets. As a result, existing MOT methods are
limited to a small set of predefined categories, and they struggle with unseen
objects in the real world. To address these issues, Generic Multiple Object
Tracking (GMOT) has been proposed, which requires less prior information about
the targets. However, all existing GMOT approaches follow a one-shot paradigm,
relying mainly on the initial bounding box and thus struggling to handle
variants e.g., viewpoint, lighting, occlusion, scale, and etc. In this paper,
we introduce a novel approach to address the limitations of existing MOT and
GMOT methods. Specifically, we propose a zero-shot GMOT (Z-GMOT) algorithm that
can track never-seen object categories with zero training examples, without the
need for predefined categories or an initial bounding box. To achieve this, we
propose iGLIP, an improved version of Grounded language-image pretraining
(GLIP), which can detect unseen objects while minimizing false positives. We
evaluate our Z-GMOT thoroughly on the GMOT-40 dataset, AnimalTrack testset,
DanceTrack testset. The results of these evaluations demonstrate a significant
improvement over existing methods. For instance, on the GMOT-40 dataset, the
Z-GMOT outperforms one-shot GMOT with OC-SORT by 27.79 points HOTA and 44.37
points MOTA. On the AnimalTrack dataset, it surpasses fully-supervised methods
with DeepSORT by 12.55 points HOTA and 8.97 points MOTA. To facilitate further
research, we will make our code and models publicly available upon acceptance
of this paper
Watch and Learn: Semi-Supervised Learning of Object Detectors from Videos
We present a semi-supervised approach that localizes multiple unknown object
instances in long videos. We start with a handful of labeled boxes and
iteratively learn and label hundreds of thousands of object instances. We
propose criteria for reliable object detection and tracking for constraining
the semi-supervised learning process and minimizing semantic drift. Our
approach does not assume exhaustive labeling of each object instance in any
single frame, or any explicit annotation of negative data. Working in such a
generic setting allow us to tackle multiple object instances in video, many of
which are static. In contrast, existing approaches either do not consider
multiple object instances per video, or rely heavily on the motion of the
objects present. The experiments demonstrate the effectiveness of our approach
by evaluating the automatically labeled data on a variety of metrics like
quality, coverage (recall), diversity, and relevance to training an object
detector.Comment: To appear in CVPR 201
Beyond SOT: Tracking Multiple Generic Objects at Once
Generic Object Tracking (GOT) is the problem of tracking target objects,
specified by bounding boxes in the first frame of a video. While the task has
received much attention in the last decades, researchers have almost
exclusively focused on the single object setting. Multi-object GOT benefits
from a wider applicability, rendering it more attractive in real-world
applications. We attribute the lack of research interest into this problem to
the absence of suitable benchmarks. In this work, we introduce a new
large-scale GOT benchmark, LaGOT, containing multiple annotated target objects
per sequence. Our benchmark allows users to tackle key remaining challenges in
GOT, aiming to increase robustness and reduce computation through joint
tracking of multiple objects simultaneously. In addition, we propose a
transformer-based GOT tracker baseline capable of joint processing of multiple
objects through shared computation. Our approach achieves a 4x faster run-time
in case of 10 concurrent objects compared to tracking each object independently
and outperforms existing single object trackers on our new benchmark. In
addition, our approach achieves highly competitive results on single-object GOT
datasets, setting a new state of the art on TrackingNet with a success rate AUC
of 84.4%. Our benchmark, code, and trained models will be made publicly
available.Comment: accepted by WACV'2
Detecting and tracking multiple interacting objects without class-specific models
We propose a framework for detecting and tracking multiple interacting objects from a single, static, uncalibrated camera. The number of objects is variable and unknown, and object-class-specific models are not available. We use background subtraction results as measurements for object detection and tracking. Given these constraints, the main challenge is to associate pixel measurements with (possibly interacting) object targets. We first track clusters of pixels, and note when they merge or split. We then build an inference graph, representing relations between the tracked clusters. Using this graph and a generic object model based on spatial connectedness and coherent motion, we label the tracked clusters as whole objects, fragments of objects or groups of interacting objects. The outputs of our algorithm are entire tracks of objects, which may include corresponding tracks from groups of objects during interactions. Experimental results on multiple video sequences are shown
- …