Search CORE

79,077 research outputs found

Multi-Object Tracking and Segmentation via Neural Message Passing

Author: Braso Guillem
Cetintas Orcun
Leal-Taixe Laura
Publication venue
Publication date: 15/07/2022
Field of study

Graphs offer a natural way to formulate Multiple Object Tracking (MOT) and Multiple Object Tracking and Segmentation (MOTS) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such structured domain is not trivial. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs). By operating directly on the graph domain, our method can reason globally over an entire set of detections and exploit contextual features. It then jointly predicts both final solutions for the data association problem and segmentation masks for all objects in the scene while exploiting synergies between the two tasks. We achieve state-of-the-art results for both tracking and segmentation in several publicly available datasets. Our code is available at github.com/ocetintas/MPNTrackSeg.Comment: arXiv admin note: substantial text overlap with arXiv:1912.0751

arXiv.org e-Print Archive

Robust Multiple Object Tracking Using ReID features and Graph Convolutional Networks

Author: Lusardi Christian
Publication venue: RIT Scholar Works
Publication date: 11/05/2021
Field of study

Deep Learning allows for great advancements in computer vision research and development. An area that is garnering attention is single object tracking and multi-object tracking. Object tracking continues to progress vastly in terms of detection and building re-identification features, but more effort needs to be dedicated to data association. In this thesis, the goal is to use a graph neural network to combine the information from both the bounding box interaction as well as the appearance feature information in a single association chain. This work is designed to explore the usage of graph neural networks and their message passing abilities during tracking to come up with stronger data associations. This thesis combines all steps from detection through association using state of the art methods along with novel re-identification applications. The metrics used to determine success are Multi-Object Tracking Accuracy (MOTA), Multi-Object Tracking Precision (MOTP), ID Switching (IDs), Mostly Tracked, and Mostly Lost. Within this work, the combination of multiple appearance feature vectors to create a stronger single feature vector is explored to improve accuracy. Different types of data augmentations such as random erase and random noise are explored and their results are examined for effectiveness during tracking. A unique application of triplet loss is also implemented to improve overall network performance as well. Throughout testing, baseline models have been improved upon and each successive improvement is added to the final model output. Each of the improvements results in the sacrifice of some performance metrics but the overall benefits outweigh the costs. The datasets used during this thesis are the UAVDT Benchmark and the MOT Challenge Dataset. These datasets cover aerial-based vehicle tracking and pedestrian tracking. The UAVDT Benchmark and MOT Challenge dataset feature crowded scenery as well as substantial object overlap. This thesis demonstrates the increased matching capabilities of a graph network when paired with a robust and accurate object detector as well as an improved set of appearance feature vectors

RIT Scholar Works

Downstream Task Self-Supervised Learning for Object Recognition and Tracking

Author: Siddique Abubakar
Publication venue: e-Publications@Marquette
Publication date: 01/04/2023
Field of study

This dissertation addresses three limitations of deep learning methods in image and video understanding-based machine vision applications. Firstly, although deep convolutional neural networks (CNNs) are efficient for image recognition applications such as object detection and segmentation, they perform poorly under perspective distortions. In real-world applications, the camera perspective is a common problem that we can address by annotating large amounts of data, thus limiting the applicability of the deep learning models. Secondly, the typical approach for single-camera tracking problems is to use separate motion and appearance models, which are expensive in terms of computations and training data requirements. Finally, conventional multi-camera video understanding techniques use supervised learning algorithms to determine temporal relationships among objects. In large-scale applications, these methods are also limited by the requirement of extensive manually annotated data and computational resources.To address these limitations, we develop an uncertainty-aware self-supervised learning (SSL) technique that captures a model\u27s instance or semantic segmentation uncertainty from overhead images and guides the model to learn the impact of the new perspective on object appearance. The test-time data augmentation-based pseudo-label refinement technique continuously trains a model until convergence on new perspective images. The proposed method can be applied for both self-supervision and semi-supervision, thus increasing the effectiveness of a deep pre-trained model in new domains. Extensive experiments demonstrate the effectiveness of the SSL technique in both object detection and semantic segmentation problems. In video understanding applications, we introduce simultaneous segmentation and tracking as an unsupervised spatio-temporal latent feature clustering problem. The jointly learned multi-task features leverage the task-dependent uncertainty to generate discriminative features in multi-object videos. Experiments have shown that the proposed tracker outperforms several state-of-the-art supervised methods. Finally, we proposed an unsupervised multi-camera tracklet association (MCTA) algorithm to track multiple objects in real-time. MCTA leverages the self-supervised detector model for single-camera tracking and solves the multi-camera tracking problem using multiple pair-wise camera associations modeled as a connected graph. The graph optimization method generates a global solution for partially or fully overlapping camera networks

epublications@Marquette

SANet: Structure-Aware Network for Visual Tracking

Author: Fan Heng
Ling Haibin
Publication venue
Publication date: 01/05/2017
Field of study

Convolutional neural network (CNN) has drawn increasing interest in visual tracking owing to its powerfulness in feature extraction. Most existing CNN-based trackers treat tracking as a classification problem. However, these trackers are sensitive to similar distractors because their CNN models mainly focus on inter-class classification. To address this problem, we use self-structure information of object to distinguish it from distractors. Specifically, we utilize recurrent neural network (RNN) to model object structure, and incorporate it into CNN to improve its robustness to similar distractors. Considering that convolutional layers in different levels characterize the object from different perspectives, we use multiple RNNs to model object structure in different levels respectively. Extensive experiments on three benchmarks, OTB100, TC-128 and VOT2015, show that the proposed algorithm outperforms other methods. Code is released at http://www.dabi.temple.edu/~hbling/code/SANet/SANet.html.Comment: In CVPR Deep Vision Workshop, 201

arXiv.org e-Print Archive

Crossref

Skeleton-based Action Recognition of People Handling Objects

Author: Choi Jin Young
Kim Sunoh
Park Jongyoul
Yun Kimin
Publication venue
Publication date: 21/01/2019
Field of study

In visual surveillance systems, it is necessary to recognize the behavior of people handling objects such as a phone, a cup, or a plastic bag. In this paper, to address this problem, we propose a new framework for recognizing object-related human actions by graph convolutional networks using human and object poses. In this framework, we construct skeletal graphs of reliable human poses by selectively sampling the informative frames in a video, which include human joints with high confidence scores obtained in pose estimation. The skeletal graphs generated from the sampled frames represent human poses related to the object position in both the spatial and temporal domains, and these graphs are used as inputs to the graph convolutional networks. Through experiments over an open benchmark and our own data sets, we verify the validity of our framework in that our method outperforms the state-of-the-art method for skeleton-based action recognition.Comment: Accepted in WACV 201

arXiv.org e-Print Archive