64,028 research outputs found
Joint Detection and Tracking in Videos with Identification Features
Recent works have shown that combining object detection and tracking tasks,
in the case of video data, results in higher performance for both tasks, but
they require a high frame-rate as a strict requirement for performance. This is
assumption is often violated in real-world applications, when models run on
embedded devices, often at only a few frames per second.
Videos at low frame-rate suffer from large object displacements. Here
re-identification features may support to match large-displaced object
detections, but current joint detection and re-identification formulations
degrade the detector performance, as these two are contrasting tasks. In the
real-world application having separate detector and re-id models is often not
feasible, as both the memory and runtime effectively double.
Towards robust long-term tracking applicable to reduced-computational-power
devices, we propose the first joint optimization of detection, tracking and
re-identification features for videos. Notably, our joint optimization
maintains the detector performance, a typical multi-task challenge. At
inference time, we leverage detections for tracking (tracking-by-detection)
when the objects are visible, detectable and slowly moving in the image. We
leverage instead re-identification features to match objects which disappeared
(e.g. due to occlusion) for several frames or were not tracked due to fast
motion (or low-frame-rate videos). Our proposed method reaches the
state-of-the-art on MOT, it ranks 1st in the UA-DETRAC'18 tracking challenge
among online trackers, and 3rd overall.Comment: Accepted at Image and Vision Computing Journa
Sensor Fusion for Object Detection and Tracking in Autonomous Vehicles
Autonomous driving vehicles depend on their perception system to understand the environment and identify all static and dynamic obstacles surrounding the vehicle. The perception system in an autonomous vehicle uses the sensory data obtained from different sensor modalities to understand the environment and perform a variety of tasks such as object detection and object tracking. Combining the outputs of different sensors to obtain a more reliable and robust outcome is called sensor fusion. This dissertation studies the problem of sensor fusion for object detection and object tracking in autonomous driving vehicles and explores different approaches for utilizing deep neural networks to accurately and efficiently fuse sensory data from different sensing modalities.
In particular, this dissertation focuses on fusing radar and camera data for 2D and 3D object detection and object tracking tasks. First, the effectiveness of radar and camera fusion for 2D object detection is investigated by introducing a radar region proposal algorithm for generating object proposals in a two-stage object detection network. The evaluation results show significant improvement in speed and accuracy compared to a vision-based proposal generation method. Next, radar and camera fusion is used for the task of joint object detection and depth estimation where the radar data is used in conjunction with image features to generate object proposals, but also provides accurate depth estimation for the detected objects in the scene. A fusion algorithm is also proposed for 3D object detection where where the depth and velocity data obtained from the radar is fused with the camera images to detect objects in 3D and also accurately estimate their velocities without requiring any temporal information. Finally, radar and camera sensor fusion is used for 3D multi-object tracking by introducing an end-to-end trainable and online network capable of tracking objects in real-time
Which Framework is Suitable for Online 3D Multi-Object Tracking for Autonomous Driving with Automotive 4D Imaging Radar?
Online 3D multi-object tracking (MOT) has recently received significant
research interests due to the expanding demand of 3D perception in advanced
driver assistance systems (ADAS) and autonomous driving (AD). Among the
existing 3D MOT frameworks for ADAS and AD, conventional point object tracking
(POT) framework using the tracking-by-detection (TBD) strategy has been well
studied and accepted for LiDAR and 4D imaging radar point clouds. In contrast,
extended object tracking (EOT), another important framework which accepts the
joint-detection-and-tracking (JDT) strategy, has rarely been explored for
online 3D MOT applications. This paper provides the first systematical
investigation of the EOT framework for online 3D MOT in real-world ADAS and AD
scenarios. Specifically, the widely accepted TBD-POT framework, the recently
investigated JDT-EOT framework, and our proposed TBD-EOT framework are compared
via extensive evaluations on two open source 4D imaging radar datasets:
View-of-Delft and TJ4DRadSet. Experiment results demonstrate that the
conventional TBD-POT framework remains preferable for online 3D MOT with high
tracking performance and low computational complexity, while the proposed
TBD-EOT framework has the potential to outperform it in certain situations.
However, the results also show that the JDT-EOT framework encounters multiple
problems and performs inadequately in evaluation scenarios. After analyzing the
causes of these phenomena based on various evaluation metrics and
visualizations, we provide possible guidelines to improve the performance of
these MOT frameworks on real-world data. These provide the first benchmark and
important insights for the future development of 4D imaging radar-based online
3D MOT.Comment: 8 pages, 5 figures, submitted to the 2024 IEEE International
Conference on Robotics and Automation (ICRA2024
DeVIS: Making Deformable Transformers Work for Video Instance Segmentation
Video Instance Segmentation (VIS) jointly tackles multi-object detection,
tracking, and segmentation in video sequences. In the past, VIS methods
mirrored the fragmentation of these subtasks in their architectural design,
hence missing out on a joint solution. Transformers recently allowed to cast
the entire VIS task as a single set-prediction problem. Nevertheless, the
quadratic complexity of existing Transformer-based methods requires long
training times, high memory requirements, and processing of low-single-scale
feature maps. Deformable attention provides a more efficient alternative but
its application to the temporal domain or the segmentation task have not yet
been explored.
In this work, we present Deformable VIS (DeVIS), a VIS method which
capitalizes on the efficiency and performance of deformable Transformers. To
reason about all VIS subtasks jointly over multiple frames, we present temporal
multi-scale deformable attention with instance-aware object queries. We further
introduce a new image and video instance mask head with multi-scale features,
and perform near-online video processing with multi-cue clip tracking. DeVIS
reduces memory as well as training time requirements, and achieves
state-of-the-art results on the YouTube-VIS 2021, as well as the challenging
OVIS dataset.
Code is available at https://github.com/acaelles97/DeVIS
Learned perception systems for self-driving vehicles
2022 Spring.Includes bibliographical references.Building self-driving vehicles is one of the most impactful technological challenges of modern artificial intelligence. Self-driving vehicles are widely anticipated to revolutionize the way people and freight move. In this dissertation, we present a collection of work that aims to improve the capability of the perception module, an essential module for safe and reliable autonomous driving. Specifically, it focuses on two perception topics: 1) Geo-localization (mapping) of spatially-compact static objects, and 2) Multi-target object detection and tracking of moving objects in the scene. Accurately estimating the position of static objects, such as traffic lights, from the moving camera of a self-driving car is a challenging problem. In this dissertation, we present a system that improves the localization of static objects by jointly optimizing the components of the system via learning. Our system is comprised of networks that perform: 1) 5DoF object pose estimation from a single image, 2) association of objects between pairs of frames, and 3) multi-object tracking to produce the final geo-localization of the static objects within the scene. We evaluate our approach using a publicly available data set, focusing on traffic lights due to data availability. For each component, we compare against contemporary alternatives and show significantly improved performance. We also show that the end-to-end system performance is further improved via joint training of the constituent models. Next, we propose an efficient joint detection and tracking model named DEFT, or "Detection Embeddings for Tracking." The proposed approach relies on an appearance-based object matching network jointly learned with an underlying object detection network. An LSTM is also added to capture motion constraints. DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards while having significant advantages in robustness when applied to more challenging tracking data. DEFT raises the bar on the nuScenes monocular 3D tracking challenge, more than doubling the performance of the previous top method (3.8x on AMOTA, 2.1x on MOTAR). We analyze the difference in performance between DEFT and the next best-published method on nuScenes and find that DEFT is more robust to occlusions and large inter-frame displacements, making it a superior choice for many use-cases. Third, we present an end-to-end model to solve the tasks of detection, tracking, and sequence modeling from raw sensor data, called Attention-based DEFT. Attention-based DEFT extends the original DEFT by adding an attentional encoder module that uses attention to compute tracklet embedding that 1) jointly reasons about the tracklet dependencies and interaction with other objects present in the scene and 2) captures the context and temporal information of the tracklet's past observations. The experimental results show that Attention-based DEFT performs favorably against or comparable to state-of-the-art trackers. Reasoning about the interactions between the actors in the scene allows Attention-based DEFT to boost the model tracking performance in heavily crowded and complex interactive scenes. We validate the sequence modeling effectiveness of the proposed approach by showing its superiority for velocity estimation task over other baseline methods on both simple and complex scenes. The experiments demonstrate the effectiveness of Attention-based DEFT for capturing spatio-temporal interaction of the crowd for velocity estimation task, which helps it to be more robust to handle complexities in densely crowded scenes. The experimental results show that all the joint models in this dissertation perform better than solving each problem independently
Rethinking the competition between detection and ReID in Multi-Object Tracking
Due to balanced accuracy and speed, joint learning detection and ReID-based
one-shot models have drawn great attention in multi-object tracking(MOT).
However, the differences between the above two tasks in the one-shot tracking
paradigm are unconsciously overlooked, leading to inferior performance than the
two-stage methods. In this paper, we dissect the reasoning process of the
aforementioned two tasks. Our analysis reveals that the competition of them
inevitably hurts the learning of task-dependent representations, which further
impedes the tracking performance. To remedy this issue, we propose a novel
cross-correlation network that can effectively impel the separate branches to
learn task-dependent representations. Furthermore, we introduce a scale-aware
attention network that learns discriminative embeddings to improve the ReID
capability. We integrate the delicately designed networks into a one-shot
online MOT system, dubbed CSTrack. Without bells and whistles, our model
achieves new state-of-the-art performances on MOT16 and MOT17. Our code is
released at https://github.com/JudasDie/SOTS
- …