125 research outputs found

    End-to-end Learning of Multi-sensor 3D Tracking by Detection

    Full text link
    In this paper we propose a novel approach to tracking by detection that can exploit both cameras as well as LIDAR data to produce very accurate 3D trajectories. Towards this goal, we formulate the problem as a linear program that can be solved exactly, and learn convolutional networks for detection as well as matching in an end-to-end manner. We evaluate our model in the challenging KITTI dataset and show very competitive results.Comment: Presented at IEEE International Conference on Robotics and Automation (ICRA), 201

    GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds

    Full text link
    Current 3D single object tracking methods are typically based on VoteNet, a 3D region proposal network. Despite the success, using a single seed point feature as the cue for offset learning in VoteNet prevents high-quality 3D proposals from being generated. Moreover, seed points with different importance are treated equally in the voting process, aggravating this defect. To address these issues, we propose a novel global-local transformer voting scheme to provide more informative cues and guide the model pay more attention on potential seed points, promoting the generation of high-quality 3D proposals. Technically, a global-local transformer (GLT) module is employed to integrate object- and patch-aware prior into seed point features to effectively form strong feature representation for geometric positions of the seed points, thus providing more robust and accurate cues for offset learning. Subsequently, a simple yet effective training strategy is designed to train the GLT module. We develop an importance prediction branch to learn the potential importance of the seed points and treat the output weights vector as a training constraint term. By incorporating the above components together, we exhibit a superior tracking method GLT-T. Extensive experiments on challenging KITTI and NuScenes benchmarks demonstrate that GLT-T achieves state-of-the-art performance in the 3D single object tracking task. Besides, further ablation studies show the advantages of the proposed global-local transformer voting scheme over the original VoteNet. Code and models will be available at https://github.com/haooozi/GLT-T.Comment: Accepted to AAAI 2023. The source code and models will be available at https://github.com/haooozi/GLT-

    Sensor Fusion for Object Detection and Tracking in Autonomous Vehicles

    Get PDF
    Autonomous driving vehicles depend on their perception system to understand the environment and identify all static and dynamic obstacles surrounding the vehicle. The perception system in an autonomous vehicle uses the sensory data obtained from different sensor modalities to understand the environment and perform a variety of tasks such as object detection and object tracking. Combining the outputs of different sensors to obtain a more reliable and robust outcome is called sensor fusion. This dissertation studies the problem of sensor fusion for object detection and object tracking in autonomous driving vehicles and explores different approaches for utilizing deep neural networks to accurately and efficiently fuse sensory data from different sensing modalities. In particular, this dissertation focuses on fusing radar and camera data for 2D and 3D object detection and object tracking tasks. First, the effectiveness of radar and camera fusion for 2D object detection is investigated by introducing a radar region proposal algorithm for generating object proposals in a two-stage object detection network. The evaluation results show significant improvement in speed and accuracy compared to a vision-based proposal generation method. Next, radar and camera fusion is used for the task of joint object detection and depth estimation where the radar data is used in conjunction with image features to generate object proposals, but also provides accurate depth estimation for the detected objects in the scene. A fusion algorithm is also proposed for 3D object detection where where the depth and velocity data obtained from the radar is fused with the camera images to detect objects in 3D and also accurately estimate their velocities without requiring any temporal information. Finally, radar and camera sensor fusion is used for 3D multi-object tracking by introducing an end-to-end trainable and online network capable of tracking objects in real-time

    Learned perception systems for self-driving vehicles

    Get PDF
    2022 Spring.Includes bibliographical references.Building self-driving vehicles is one of the most impactful technological challenges of modern artificial intelligence. Self-driving vehicles are widely anticipated to revolutionize the way people and freight move. In this dissertation, we present a collection of work that aims to improve the capability of the perception module, an essential module for safe and reliable autonomous driving. Specifically, it focuses on two perception topics: 1) Geo-localization (mapping) of spatially-compact static objects, and 2) Multi-target object detection and tracking of moving objects in the scene. Accurately estimating the position of static objects, such as traffic lights, from the moving camera of a self-driving car is a challenging problem. In this dissertation, we present a system that improves the localization of static objects by jointly optimizing the components of the system via learning. Our system is comprised of networks that perform: 1) 5DoF object pose estimation from a single image, 2) association of objects between pairs of frames, and 3) multi-object tracking to produce the final geo-localization of the static objects within the scene. We evaluate our approach using a publicly available data set, focusing on traffic lights due to data availability. For each component, we compare against contemporary alternatives and show significantly improved performance. We also show that the end-to-end system performance is further improved via joint training of the constituent models. Next, we propose an efficient joint detection and tracking model named DEFT, or "Detection Embeddings for Tracking." The proposed approach relies on an appearance-based object matching network jointly learned with an underlying object detection network. An LSTM is also added to capture motion constraints. DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards while having significant advantages in robustness when applied to more challenging tracking data. DEFT raises the bar on the nuScenes monocular 3D tracking challenge, more than doubling the performance of the previous top method (3.8x on AMOTA, 2.1x on MOTAR). We analyze the difference in performance between DEFT and the next best-published method on nuScenes and find that DEFT is more robust to occlusions and large inter-frame displacements, making it a superior choice for many use-cases. Third, we present an end-to-end model to solve the tasks of detection, tracking, and sequence modeling from raw sensor data, called Attention-based DEFT. Attention-based DEFT extends the original DEFT by adding an attentional encoder module that uses attention to compute tracklet embedding that 1) jointly reasons about the tracklet dependencies and interaction with other objects present in the scene and 2) captures the context and temporal information of the tracklet's past observations. The experimental results show that Attention-based DEFT performs favorably against or comparable to state-of-the-art trackers. Reasoning about the interactions between the actors in the scene allows Attention-based DEFT to boost the model tracking performance in heavily crowded and complex interactive scenes. We validate the sequence modeling effectiveness of the proposed approach by showing its superiority for velocity estimation task over other baseline methods on both simple and complex scenes. The experiments demonstrate the effectiveness of Attention-based DEFT for capturing spatio-temporal interaction of the crowd for velocity estimation task, which helps it to be more robust to handle complexities in densely crowded scenes. The experimental results show that all the joint models in this dissertation perform better than solving each problem independently

    Multi-Object Detection, Pose Estimation and Tracking in Panoramic Monocular Imagery for Autonomous Vehicle Perception

    Get PDF
    While active sensing such as radars, laser-based ranging (LiDAR) and ultrasonic sensors are nearly ubiquitous in modern autonomous vehicle prototypes, cameras are more versatile because they are nonetheless essential for tasks such as road marking detection and road sign reading. Active sensing technologies are widely used because active sensors are, by nature, usually more reliable than cameras to detect objects, however they are lower resolution, break in challenging environmental conditions such as rain and heavy reflections, as well as materials such as black paint. Therefore, in this work, we focus primarily on passive sensing technologies. More specifically, we look at monocular imagery and to what extent, it can be used as replacement for more complex sensing systems such as stereo, multi-view cameras and LiDAR. Whilst the main strength of LiDAR is its ability to measure distances and naturally enable 3D reasoning; in contrast, camera-based object detection is typically restricted to the 2D image space. We propose a convolutional neural network extending object detection to estimate the 3D pose and velocity of objects from a single monocular camera. Our approach is based on a siamese neural network able to process pair of video frames to integrate temporal information. While the prior work has focused almost exclusively on the processing of forward-facing rectified rectilinear vehicle mounted cameras, there are no studies of panoramic imagery in the context of autonomous driving. We introduce an approach to adapt existing convolutional neural networks to unseen 360° panoramic imagery using domain adaptation via style transfer. We also introduce a new synthetic evaluation dataset and benchmark for 3D object detection and depth estimation in automotive panoramic imagery. Multi-object tracking-by-detection is often split into two parts: a detector and a tracker. In contrast, we investigate the use of end-to-end recurrent convolutional networks to process automotive video sequences to jointly detect and track objects through time. We present a multitask neural network able to track online the 3D pose of objects in panoramic video sequences. Our work highlights that monocular imagery, in conjunction with the proposed algorithmic approaches, can offer an effective replacement for more expensive active sensors to estimate depth, to estimate and track the 3D pose of objects surrounding the ego-vehicle; thus demonstrating that autonomous driving could be achieved using a limited number of cameras or even a single 360° panoramic camera, akin to a human driver perception
    • …
    corecore