226 research outputs found
Online Map Vectorization for Autonomous Driving: A Rasterization Perspective
Vectorized high-definition (HD) map is essential for autonomous driving,
providing detailed and precise environmental information for advanced
perception and planning. However, current map vectorization methods often
exhibit deviations, and the existing evaluation metric for map vectorization
lacks sufficient sensitivity to detect these deviations. To address these
limitations, we propose integrating the philosophy of rasterization into map
vectorization. Specifically, we introduce a new rasterization-based evaluation
metric, which has superior sensitivity and is better suited to real-world
autonomous driving scenarios. Furthermore, we propose MapVR (Map Vectorization
via Rasterization), a novel framework that applies differentiable rasterization
to vectorized outputs and then performs precise and geometry-aware supervision
on rasterized HD maps. Notably, MapVR designs tailored rasterization strategies
for various geometric shapes, enabling effective adaptation to a wide range of
map elements. Experiments show that incorporating rasterization into map
vectorization greatly enhances performance with no extra computational cost
during inference, leading to more accurate map perception and ultimately
promoting safer autonomous driving.Comment: [NeurIPS 2023
VectorMapNet: End-to-end Vectorized HD Map Learning
Autonomous driving systems require a good understanding of surrounding
environments, including moving obstacles and static High-Definition (HD)
semantic map elements. Existing methods approach the semantic map problem by
offline manual annotation, which suffers from serious scalability issues.
Recent learning-based methods produce dense rasterized segmentation predictions
to construct maps. However, these predictions do not include instance
information of individual map elements and require heuristic post-processing to
obtain vectorized maps. To tackle these challenges, we introduce an end-to-end
vectorized HD map learning pipeline, termed VectorMapNet. VectorMapNet takes
onboard sensor observations and predicts a sparse set of polylines in the
bird's-eye view. This pipeline can explicitly model the spatial relation
between map elements and generate vectorized maps that are friendly to
downstream autonomous driving tasks. Extensive experiments show that
VectorMapNet achieve strong map learning performance on both nuScenes and
Argoverse2 dataset, surpassing previous state-of-the-art methods by 14.2 mAP
and 14.6mAP. Qualitatively, we also show that VectorMapNet is capable of
generating comprehensive maps and capturing more fine-grained details of road
geometry. To the best of our knowledge, VectorMapNet is the first work designed
towards end-to-end vectorized map learning from onboard observations. Our
project website is available at
https://tsinghua-mars-lab.github.io/vectormapnet/
LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving
A map, as crucial information for downstream applications of an autonomous
driving system, is usually represented in lanelines or centerlines. However,
existing literature on map learning primarily focuses on either detecting
geometry-based lanelines or perceiving topology relationships of centerlines.
Both of these methods ignore the intrinsic relationship of lanelines and
centerlines, that lanelines bind centerlines. While simply predicting both
types of lane in one model is mutually excluded in learning objective, we
advocate lane segment as a new representation that seamlessly incorporates both
geometry and topology information. Thus, we introduce LaneSegNet, the first
end-to-end mapping network generating lane segments to obtain a complete
representation of the road structure. Our algorithm features two key
modifications. One is a lane attention module to capture pivotal region details
within the long-range feature space. Another is an identical initialization
strategy for reference points, which enhances the learning of positional priors
for lane attention. On the OpenLane-V2 dataset, LaneSegNet outperforms previous
counterparts by a substantial gain across three tasks, \textit{i.e.}, map
element detection (+4.8 mAP), centerline perception (+6.9 DET), and the
newly defined one, lane segment perception (+5.6 mAP). Furthermore, it obtains
a real-time inference speed of 14.7 FPS. Code is accessible at
https://github.com/OpenDriveLab/LaneSegNet.Comment: Accepted in ICLR 202
Recommended from our members
Recognizing human activity using RGBD data
textTraditional computer vision algorithms try to understand the world using visible light cameras. However, there are inherent limitations of this type of data source. First, visible light images are sensitive to illumination changes and background clutter. Second, the 3D structural information of the scene is lost when projecting the 3D world to 2D images. Recovering the 3D information from 2D images is a challenging problem. Range sensors have existed for over thirty years, which capture 3D characteristics of the scene. However, earlier range sensors were either too expensive, difficult to use in human environments, slow at acquiring data, or provided a poor estimation of distance. Recently, the easy access to the RGBD data at real-time frame rate is leading to a revolution in perception and inspired many new research using RGBD data. I propose algorithms to detect persons and understand the activities using RGBD data. I demonstrate the solutions to many computer vision problems may be improved with the added depth channel. The 3D structural information may give rise to algorithms with real-time and view-invariant properties in a faster and easier fashion. When both data sources are available, the features extracted from the depth channel may be combined with traditional features computed from RGB channels to generate more robust systems with enhanced recognition abilities, which may be able to deal with more challenging scenarios. As a starting point, the first problem is to find the persons of various poses in the scene, including moving or static persons. Localizing humans from RGB images is limited by the lighting conditions and background clutter. Depth image gives alternative ways to find the humans in the scene. In the past, detection of humans from range data is usually achieved by tracking, which does not work for indoor person detection. In this thesis, I propose a model based approach to detect the persons using the structural information embedded in the depth image. I propose a 2D head contour model and a 3D head surface model to look for the head-shoulder part of the person. Then, a segmentation scheme is proposed to segment the full human body from the background and extract the contour. I also give a tracking algorithm based on the detection result. I further research on recognizing human actions and activities. I propose two features for recognizing human activities. The first feature is drawn from the skeletal joint locations estimated from a depth image. It is a compact representation of the human posture called histograms of 3D joint locations (HOJ3D). This representation is view-invariant and the whole algorithm runs at real-time. This feature may benefit many applications to get a fast estimation of the posture and action of the human subject. The second feature is a spatio-temporal feature for depth video, which is called Depth Cuboid Similarity Feature (DCSF). The interest points are extracted using an algorithm that effectively suppresses the noise and finds salient human motions. DCSF is extracted centered on each interest point, which forms the description of the video contents. This descriptor can be used to recognize the activities with no dependence on skeleton information or pre-processing steps such as motion segmentation, tracking, or even image de-noising or hole-filling. It is more flexible and widely applicable to many scenarios. Finally, all the features herein developed are combined to solve a novel problem: first-person human activity recognition using RGBD data. Traditional activity recognition algorithms focus on recognizing activities from a third-person perspective. I propose to recognize activities from a first-person perspective with RGBD data. This task is very novel and extremely challenging due to the large amount of camera motion either due to self exploration or the response of the interaction. I extracted 3D optical flow features as the motion descriptor, 3D skeletal joints features as posture descriptors, spatio-temporal features as local appearance descriptors to describe the first-person videos. To address the ego-motion of the camera, I propose an attention mask to guide the recognition procedures and separate the features on the ego-motion region and independent-motion region. The 3D features are very useful at summarizing the discerning information of the activities. In addition, the combination of the 3D features with existing 2D features brings more robust recognition results and make the algorithm capable of dealing with more challenging cases.Electrical and Computer Engineerin
Hand tracking using a quadric surface model and Bayesian filtering
Within this paper a technique for model-based 3D hand tracking is presented. A hand model is built from a set of truncated quadrics, approximating the anatomy of a real hand with few parameters. Given that the projection of a quadric onto the image plane is a conic, the contours can be generated efficiently. These model contours are used as shape templates to evaluate possible matches in the current frame. The evaluation is done within a hierarchical Bayesian filtering framework, where the posterior distribution is computed efficiently using a tree of templates. We demonstrate the effectiveness of the technique by using it for tracking 3D articulated and non-rigid hand motion from monocular video sequences in front of a cluttered background
Estimating 3D hand pose using hierarchical multi-label classification
This paper presents an analysis of the design of classifiers for use in a hierarchical object recognition approach. In this approach, a cascade of classifiers is arranged in a tree in order to recognize multiple object classes. We are interested in the problem of recognizing multiple patterns as it is closely related to the problem of locating an articulated object. Each different pattern class corresponds to the hand in a different pose, or set of poses. For this problem obtaining labelled training data of the hand in a given pose can be problematic. Given a parametric 3D model, generating training data in the form of example images is cheap, and we demonstrate that it can be used to design classifiers almost as good as those trained using non-synthetic data. We compare a variety of different template-based classifiers and discuss their merits
People detection and tracking using a network of low-cost depth cameras
Automaattinen ihmisten havainnointi on jo laajalti käytetty teknologia, jolla on sovelluksia esimerkiksi kaupan ja turvallisuuden aloilla. Tämän diplomityön tarkoituksena on suunnitella yleiskäyttöinen järjestelmä ihmisten havainnointiin sisätiloissa. Tässä työssä ensin esitetään kirjallisuudesta löytyvät ratkaisut ihmisten havainnointiin, seurantaan ja tunnistamiseen. Painopiste on syvyyskuvaa hyödyntävissä havaitsemismenetelmissä. Lisäksi esittellään kehitetty älykkäiden syvyyskameroiden verkko. Havainnointitarkkuutta kokeillaan neljällä kuvasarjalla, jotka sisältävät yli 20 000 syvyyskuvaa. Tulokset ovat lupaavia ja näyttävät, että yksinkertaiset ja laskennallisesti kevyet ratkaisut sopivat hyvin käytännön sovelluksiin.Automatic people detection is a widely adopted technology that has applications in retail stores, crowd management and surveillance. The goal of this work is to create a general purpose people detection framework. First, studies on people detection, tracking and re-identification are reviewed. The emphasis is on people detection from depth images. Furthermore, an approach based on a network of smart depth cameras is presented. The performance is evaluated with four image sequences, totalling over 20 000 depth images. Experimental results show that simple and lightweight algorithms are very useful in practical applications
Comparison of fusion methods for thermo-visual surveillance tracking
In this paper, we evaluate the appearance tracking performance of multiple fusion schemes that combine information from standard CCTV and thermal infrared spectrum video for the tracking of surveillance objects, such as people, faces, bicycles and vehicles. We show results on numerous real world multimodal surveillance sequences, tracking challenging objects whose appearance changes rapidly. Based on these results we can determine the most promising fusion scheme
Body Parts Features Based Pedestrian Detection for Active Pedestrian Protection System
A novel pedestrian detection system based on vision in urban traffic situations is presented to help the driver perceive the pedestrian ahead of vehicle. To enhance the accuracy and to decrease the time consumption of pedestrian detection in such complicated situations, the pedestrian is detected by dividing it into several parts according to their corresponding features in the image. The candidate pedestrian leg is segmented based on the gentle Adaboost algorithm by training the optimized histogram of gradient features. The candidate pedestrian head is located by matching the pedestrian head and shoulder model above the region of the candidate leg. Then the candidate leg, head and shoulder are combined by parts constraint and threshold adjustment to verify the existence of pedestrian. Experiments in real urban traffic circumstances were conducted finally. Results show that the proposed pedestrian detection method can achieve a pedestrian detection rate of 92.1% with less time consumption
- …