We propose a 3D object detection method for autonomous driving by fully
exploiting the sparse and dense, semantic and geometry information in stereo
imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo
inputs to simultaneously detect and associate object in left and right images.
We add extra branches after stereo Region Proposal Network (RPN) to predict
sparse keypoints, viewpoints, and object dimensions, which are combined with 2D
left-right boxes to calculate a coarse 3D object bounding box. We then recover
the accurate 3D bounding box by a region-based photometric alignment using left
and right RoIs. Our method does not require depth input and 3D position
supervision, however, outperforms all existing fully supervised image-based
methods. Experiments on the challenging KITTI dataset show that our method
outperforms the state-of-the-art stereo-based method by around 30% AP on both
3D detection and 3D localization tasks. Code has been released at
https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.Comment: Accepted by cvpr201