915 research outputs found
Bidirectional Propagation for Cross-Modal 3D Object Detection
Recent works have revealed the superiority of feature-level fusion for
cross-modal 3D object detection, where fine-grained feature propagation from 2D
image pixels to 3D LiDAR points has been widely adopted for performance
improvement. Still, the potential of heterogeneous feature propagation between
2D and 3D domains has not been fully explored. In this paper, in contrast to
existing pixel-to-point feature propagation, we investigate an opposite
point-to-pixel direction, allowing point-wise features to flow inversely into
the 2D image branch. Thus, when jointly optimizing the 2D and 3D streams, the
gradients back-propagated from the 2D image branch can boost the representation
ability of the 3D backbone network working on LiDAR point clouds. Then,
combining pixel-to-point and point-to-pixel information flow mechanisms, we
construct an bidirectional feature propagation framework, dubbed BiProDet. In
addition to the architectural design, we also propose normalized local
coordinates map estimation, a new 2D auxiliary task for the training of the 2D
image branch, which facilitates learning local spatial-aware features from the
image modality and implicitly enhances the overall 3D detection performance.
Extensive experiments and ablation studies validate the effectiveness of our
method. Notably, we rank on the highly competitive
KITTI benchmark on the cyclist class by the time of submission. The source code
is available at https://github.com/Eaphan/BiProDet.Comment: Accepted by ICLR2023. Code is avaliable at
https://github.com/Eaphan/BiProDe
GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data
Widely-used LiDAR-based 3D object detectors often neglect fundamental
geometric information readily available from the object proposals in their
confidence estimation. This is mostly due to architectural design choices,
which were often adopted from the 2D image domain, where geometric context is
rarely available. In 3D, however, considering the object properties and its
surroundings in a holistic way is important to distinguish between true and
false positive detections, e.g. occluded pedestrians in a group. To address
this, we present GACE, an intuitive and highly efficient method to improve the
confidence estimation of a given black-box 3D object detector. We aggregate
geometric cues of detections and their spatial relationships, which enables us
to properly assess their plausibility and consequently, improve the confidence
estimation. This leads to consistent performance gains over a variety of
state-of-the-art detectors. Across all evaluated detectors, GACE proves to be
especially beneficial for the vulnerable road user classes, i.e. pedestrians
and cyclists.Comment: ICCV 2023, code is available at https://github.com/dschinagl/gac
Robust and Efficient Inference of Scene and Object Motion in Multi-Camera Systems
Multi-camera systems have the ability to overcome some of the fundamental limitations of single camera based systems. Having multiple view points of a scene goes a long way in limiting the influence of field of view, occlusion, blur and poor resolution of an individual camera. This dissertation addresses robust and efficient inference of object motion and scene in multi-camera and multi-sensor systems.
The first part of the dissertation discusses the role of constraints introduced by projective imaging towards robust inference of multi-camera/sensor based object motion. We discuss the role of the homography and epipolar constraints for fusing object motion perceived by individual cameras. For planar scenes, the homography constraints provide a natural mechanism for data association. For scenes that are not planar, the epipolar constraint provides a weaker multi-view relationship. We use the epipolar constraint for tracking in multi-camera and multi-sensor networks. In particular, we show that the epipolar constraint reduces the dimensionality of the state space of the
problem by introducing a ``shared'' state space for the joint tracking problem. This allows for robust tracking even when one of the sensors fail due to poor SNR or occlusion.
The second part of the dissertation deals with challenges in the computational aspects of tracking algorithms that are common to such systems. Much of the inference in the multi-camera and multi-sensor networks deal with complex non-linear models corrupted with non-Gaussian noise. Particle filters provide approximate Bayesian inference in such settings. We analyze the computational drawbacks of traditional particle filtering algorithms, and present a method for implementing the particle filter using the Independent Metropolis Hastings sampler, that is highly amenable to pipelined implementations and parallelization. We analyze the implementations of the proposed algorithm, and in particular concentrate on implementations that have
minimum processing times.
The last part of the dissertation deals with the efficient sensing paradigm of compressing sensing (CS) applied to signals in imaging, such as natural images and reflectance fields. We propose a hybrid signal model on the assumption that most real-world signals exhibit subspace compressibility as well as sparse representations. We show that several real-world visual signals such as images, reflectance fields, videos etc., are better approximated by this hybrid of two models. We derive optimal hybrid linear projections of the signal and show that theoretical guarantees and algorithms designed for CS can be easily extended to hybrid subspace-compressive sensing. Such methods reduce the
amount of information sensed by a camera, and help in reducing the so called data deluge problem in large multi-camera systems
3D Object Reconstruction from Imperfect Depth Data Using Extended YOLOv3 Network
State-of-the-art intelligent versatile applications provoke the usage of full 3D, depth-based streams, especially in the scenarios of intelligent remote control and communications, where virtual and augmented reality will soon become outdated and are forecasted to be replaced by point cloud streams providing explorable 3D environments of communication and industrial data. One of the most novel approaches employed in modern object reconstruction methods is to use a priori knowledge of the objects that are being reconstructed. Our approach is different as we strive to reconstruct a 3D object within much more difficult scenarios of limited data availability. Data stream is often limited by insufficient depth camera coverage and, as a result, the objects are occluded and data is lost. Our proposed hybrid artificial neural network modifications have improved the reconstruction results by 8.53 which allows us for much more precise filling of occluded object sides and reduction of noise during the process. Furthermore, the addition of object segmentation masks and the individual object instance classification is a leap forward towards a general-purpose scene reconstruction as opposed to a single object reconstruction task due to the ability to mask out overlapping object instances and using only masked object area in the reconstruction process
Lidar-based Gait Analysis and Activity Recognition in a 4D Surveillance System
This paper presents new approaches for gait and activity analysis based on data streams of a Rotating Multi Beam (RMB) Lidar sensor. The proposed algorithms are embedded into an integrated 4D vision and visualization system, which is able to analyze and interactively display real scenarios in natural outdoor environments with walking pedestrians. The main focus of the investigations are gait based person re-identification during tracking, and recognition of specific activity patterns such as bending, waving, making phone calls and checking the time looking at wristwatches.
The descriptors for training and recognition are observed and extracted from realistic outdoor surveillance scenarios, where multiple pedestrians are walking in the field of interest following possibly intersecting trajectories, thus the observations might often be affected by occlusions or background noise.
Since there is no public database available for such scenarios, we created and published a new Lidar-based outdoors gait and activity dataset on our website, that contains point cloud sequences of 28 different persons extracted and aggregated from 35 minutes-long measurements.
The presented results confirm that both efficient gait-based identification and activity recognition is achievable in the sparse point clouds of a single RMB Lidar sensor. After extracting the people trajectories, we synthesized a free-viewpoint video, where moving avatar models follow the trajectories of the observed pedestrians in real time, ensuring that the leg movements of the animated avatars are synchronized with the real gait cycles observed in the Lidar stream
CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception
Autonomous driving requires an accurate and fast 3D perception system that
includes 3D object detection, tracking, and segmentation. Although recent
low-cost camera-based approaches have shown promising results, they are
susceptible to poor illumination or bad weather conditions and have a large
localization error. Hence, fusing camera with low-cost radar, which provides
precise long-range measurement and operates reliably in all environments, is
promising but has not yet been thoroughly investigated. In this paper, we
propose Camera Radar Net (CRN), a novel camera-radar fusion framework that
generates a semantically rich and spatially accurate bird's-eye-view (BEV)
feature map for various tasks. To overcome the lack of spatial information in
an image, we transform perspective view image features to BEV with the help of
sparse but accurate radar points. We further aggregate image and radar feature
maps in BEV using multi-modal deformable attention designed to tackle the
spatial misalignment between inputs. CRN with real-time setting operates at 20
FPS while achieving comparable performance to LiDAR detectors on nuScenes, and
even outperforms at a far distance on 100m setting. Moreover, CRN with offline
setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among
all camera and camera-radar 3D object detectors.Comment: IEEE/CVF International Conference on Computer Vision (ICCV'23
- …