3,077 research outputs found
Stereo R-CNN based 3D Object Detection for Autonomous Driving
We propose a 3D object detection method for autonomous driving by fully
exploiting the sparse and dense, semantic and geometry information in stereo
imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo
inputs to simultaneously detect and associate object in left and right images.
We add extra branches after stereo Region Proposal Network (RPN) to predict
sparse keypoints, viewpoints, and object dimensions, which are combined with 2D
left-right boxes to calculate a coarse 3D object bounding box. We then recover
the accurate 3D bounding box by a region-based photometric alignment using left
and right RoIs. Our method does not require depth input and 3D position
supervision, however, outperforms all existing fully supervised image-based
methods. Experiments on the challenging KITTI dataset show that our method
outperforms the state-of-the-art stereo-based method by around 30% AP on both
3D detection and 3D localization tasks. Code has been released at
https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.Comment: Accepted by cvpr201
Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints
We propose Shift R-CNN, a hybrid model for monocular 3D object detection,
which combines deep learning with the power of geometry. We adapt a Faster
R-CNN network for regressing initial 2D and 3D object properties and combine it
with a least squares solution for the inverse 2D to 3D geometric mapping
problem, using the camera projection matrix. The closed-form solution of the
mathematical system, along with the initial output of the adapted Faster R-CNN
are then passed through a final ShiftNet network that refines the result using
our newly proposed Volume Displacement Loss. Our novel, geometrically
constrained deep learning approach to monocular 3D object detection obtains top
results on KITTI 3D Object Detection Benchmark, being the best among all
monocular methods that do not use any pre-trained network for depth estimation.Comment: v1: Accepted to be published in 2019 IEEE International Conference on
Image Processing, Sep 22-25, 2019, Taipei. IEEE Copyright notice added. Minor
changes for camera-ready version. (updated May. 15, 2019
The ApolloScape Open Dataset for Autonomous Driving and its Application
Autonomous driving has attracted tremendous attention especially in the past
few years. The key techniques for a self-driving car include solving tasks like
3D map construction, self-localization, parsing the driving road and
understanding objects, which enable vehicles to reason and act. However, large
scale data set for training and system evaluation is still a bottleneck for
developing robust perception models. In this paper, we present the ApolloScape
dataset [1] and its applications for autonomous driving. Compared with existing
public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape
contains much large and richer labelling including holistic semantic dense
point cloud for each site, stereo, per-pixel semantic labelling, lanemark
labelling, instance segmentation, 3D car instance, high accurate location for
every frame in various driving videos from multiple sites, cities and daytimes.
For each task, it contains at lease 15x larger amount of images than SOTA
datasets. To label such a complete dataset, we develop various tools and
algorithms specified for each task to accelerate the labelling process, such as
3D-2D segment labeling tools, active labelling in videos etc. Depend on
ApolloScape, we are able to develop algorithms jointly consider the learning
and inference of multiple tasks. In this paper, we provide a sensor fusion
scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and
a 3D semantic map in order to achieve robust self-localization and semantic
segmentation for autonomous driving. We show that practically, sensor fusion
and joint learning of multiple tasks are beneficial to achieve a more robust
and accurate system. We expect our dataset and proposed relevant algorithms can
support and motivate researchers for further development of multi-sensor fusion
and multi-task learning in the field of computer vision.Comment: Version 4: Accepted by TPAMI. Version 3: 17 pages, 10 tables, 11
figures, added the application (DeLS-3D) based on the ApolloScape Dataset.
Version 2: 7 pages, 6 figures, added comparison with BDD100K datase
SS3D: Single Shot 3D Object Detector
Single stage deep learning algorithm for 2D object detection was made popular
by Single Shot MultiBox Detector (SSD) and it was heavily adopted in several
embedded applications. PointPillars is a state of the art 3D object detection
algorithm that uses a Single Shot Detector adapted for 3D object detection. The
main downside of PointPillars is that it has a two stage approach with learned
input representation based on fully connected layers followed by the Single
Shot Detector for 3D detection. In this paper we present Single Shot 3D Object
Detection (SS3D) - a single stage 3D object detection algorithm which combines
straight forward, statistically computed input representation and a Single Shot
Detector (based on PointPillars). Computing the input representation is
straight forward, does not involve learning and does not have much
computational cost. We also extend our method to stereo input and show that,
aided by additional semantic segmentation input; our method produces similar
accuracy as state of the art stereo based detectors. Achieving the accuracy of
two stage detectors using a single stage approach is important as single stage
approaches are simpler to implement in embedded, real-time applications. With
LiDAR as well as stereo input, our method outperforms PointPillars. When using
LiDAR input, our input representation is able to improve the AP3D of Cars
objects in the moderate category from 74.99 to 76.84. When using stereo input,
our input representation is able to improve the AP3D of Cars objects in the
moderate category from 38.13 to 45.13. Our results are also better than other
popular 3D object detectors such as AVOD and F-PointNet
Real-time 3D Traffic Cone Detection for Autonomous Driving
Considerable progress has been made in semantic scene understanding of road
scenes with monocular cameras. It is, however, mainly related to certain
classes such as cars and pedestrians. This work investigates traffic cones, an
object class crucial for traffic control in the context of autonomous vehicles.
3D object detection using images from a monocular camera is intrinsically an
ill-posed problem. In this work, we leverage the unique structure of traffic
cones and propose a pipelined approach to the problem. Specifically, we first
detect cones in images by a tailored 2D object detector; then, the spatial
arrangement of keypoints on a traffic cone are detected by our deep structural
regression network, where the fact that the cross-ratio is projection invariant
is leveraged for network regularization; finally, the 3D position of cones is
recovered by the classical Perspective n-Point algorithm. Extensive experiments
show that our approach can accurately detect traffic cones and estimate their
position in the 3D world in real time. The proposed method is also deployed on
a real-time, critical system. It runs efficiently on the low-power Jetson TX2,
providing accurate 3D position estimates, allowing a race-car to map and drive
autonomously on an unseen track indicated by traffic cones. With the help of
robust and accurate perception, our race-car won both Formula Student
Competitions held in Italy and Germany in 2018, cruising at a top-speed of 54
kmph. Visualization of the complete pipeline, mapping and navigation can be
found on our project page.Comment: IEEE Intelligent Vehicles Symposium (IV'19). arXiv admin note: text
overlap with arXiv:1809.1054
False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier
Recently, researchers have been leveraging LiDAR point cloud for higher
accuracy in 3D vehicle detection. Most state-of-the-art methods are deep
learning based, but are easily affected by the number of points generated on
the object. This vulnerability leads to numerous false positive boxes at high
recall positions, where objects are occasionally predicted with few points. To
address the issue, we introduce Penetrated Point Classifier (PPC) based on the
underlying property of LiDAR that points cannot be generated behind vehicles.
It determines whether a point exists behind the vehicle of the predicted box,
and if does, the box is distinguished as false positive. Our straightforward
yet unprecedented approach is evaluated on KITTI dataset and achieved
performance improvement of PointRCNN, one of the state-of-the-art methods. The
experiment results show that precision at the highest recall position is
dramatically increased by 15.46 percentage points and 14.63 percentage points
on the moderate and hard difficulty of car class, respectively.Comment: Accepted by ICIP 202
AMZ Driverless: The Full Autonomous Racing System
This paper presents the algorithms and system architecture of an autonomous
racecar. The introduced vehicle is powered by a software stack designed for
robustness, reliability, and extensibility. In order to autonomously race
around a previously unknown track, the proposed solution combines state of the
art techniques from different fields of robotics. Specifically, perception,
estimation, and control are incorporated into one high-performance autonomous
racecar. This complex robotic system, developed by AMZ Driverless and ETH
Zurich, finished 1st overall at each competition we attended: Formula Student
Germany 2017, Formula Student Italy 2018 and Formula Student Germany 2018. We
discuss the findings and learnings from these competitions and present an
experimental evaluation of each module of our solution.Comment: 40 pages, 32 figures, submitted to Journal of Field Robotic
Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving
In this paper, we propose a monocular 3D object detection framework in the
domain of autonomous driving. Unlike previous image-based methods which focus
on RGB feature extracted from 2D images, our method solves this problem in the
reconstructed 3D space in order to exploit 3D contexts explicitly. To this end,
we first leverage a stand-alone module to transform the input data from 2D
image plane to 3D point clouds space for a better input representation, then we
perform the 3D detection using PointNet backbone net to obtain objects 3D
locations, dimensions and orientations. To enhance the discriminative
capability of point clouds, we propose a multi-modal feature fusion module to
embed the complementary RGB cue into the generated point clouds representation.
We argue that it is more effective to infer the 3D bounding boxes from the
generated 3D scene space (i.e., X,Y, Z space) compared to the image plane
(i.e., R,G,B image plane). Evaluation on the challenging KITTI dataset shows
that our approach boosts the performance of state-of-the-art monocular approach
by a large margin.Comment: To appear in ICCV'1
Fusing Bird View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection
We propose a new method for fusing a LIDAR point cloud and camera-captured
images in the deep convolutional neural network (CNN). The proposed method
constructs a new layer called non-homogeneous pooling layer to transform
features between bird view map and front view map. The sparse LIDAR point cloud
is used to construct the mapping between the two maps. The pooling layer allows
efficient fusion of the bird view and front view features at any stage of the
network. This is favorable for the 3D-object detection using camera-LIDAR
fusion in autonomous driving scenarios. A corresponding deep CNN is designed
and tested on the KITTI bird view object detection dataset, which produces 3D
bounding boxes from the bird view map. The fusion method shows particular
benefit for detection of pedestrians in the bird view compared to other
fusion-based object detection networks.Comment: 10 pages, 6 figures, 3 table
Stereo Vision Based Single-Shot 6D Object Pose Estimation for Bin-Picking by a Robot Manipulator
We propose a fast and accurate method of 6D object pose estimation for
bin-picking of mechanical parts by a robot manipulator. We extend the
single-shot approach to stereo vision by application of attention architecture.
Our convolutional neural network model regresses to object locations and
rotations from either a left image or a right image without depth information.
Then, a stereo feature matching module, designated as Stereo Grid Attention,
generates stereo grid matching maps. The important point of our method is only
to calculate disparity of the objects found by the attention from stereo
images, instead of calculating a point cloud over the entire image. The
disparity value is then used to calculate the depth to the objects by the
principle of triangulation. Our method also achieves a rapid processing speed
of pose estimation by the single-shot architecture and it is possible to
process a 1024 x 1024 pixels image in 75 milliseconds on the Jetson AGX Xavier
implemented with half-float model. Weakly textured mechanical parts are used to
exemplify the method. First, we create original synthetic datasets for training
and evaluating of the proposed model. This dataset is created by capturing and
rendering numerous 3D models of several types of mechanical parts in virtual
space. Finally, we use a robotic manipulator with an electromagnetic gripper to
pick up the mechanical parts in a cluttered state to verify the validity of our
method in an actual scene. When a raw stereo image is used by the proposed
method from our stereo camera to detect black steel screws, stainless screws,
and DC motor parts, i.e., cases, rotor cores and commutator caps, the
bin-picking tasks are successful with 76.3%, 64.0%, 50.5%, 89.1% and 64.2%
probability, respectively.Comment: 7 pages, 8 figure
- …