29 research outputs found
3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose from Monocular Video
Depth and ego-motion estimations are essential for the localization and
navigation of autonomous robots and autonomous driving. Recent studies make it
possible to learn the per-pixel depth and ego-motion from the unlabeled
monocular video. A novel unsupervised training framework is proposed with 3D
hierarchical refinement and augmentation using explicit 3D geometry. In this
framework, the depth and pose estimations are hierarchically and mutually
coupled to refine the estimated pose layer by layer. The intermediate view
image is proposed and synthesized by warping the pixels in an image with the
estimated depth and coarse pose. Then, the residual pose transformation can be
estimated from the new view image and the image of the adjacent frame to refine
the coarse pose. The iterative refinement is implemented in a differentiable
manner in this paper, making the whole framework optimized uniformly.
Meanwhile, a new image augmentation method is proposed for the pose estimation
by synthesizing a new view image, which creatively augments the pose in 3D
space but gets a new augmented 2D image. The experiments on KITTI demonstrate
that our depth estimation achieves state-of-the-art performance and even
surpasses recent approaches that utilize other auxiliary tasks. Our visual
odometry outperforms all recent unsupervised monocular learning-based methods
and achieves competitive performance to the geometry-based method, ORB-SLAM2
with back-end optimization.Comment: 10 pages, 7 figures, under revie
Joint Motion, Semantic Segmentation, Occlusion, and Depth Estimation
Visual scene understanding is one of the most important components of autonomous navigation. It includes multiple computer vision tasks such as recognizing objects, perceiving their 3D structure, and analyzing their motion, all of which have gone through remarkable progress over the recent years. However, most of the earlier studies have explored these components individually, and thus potential benefits from exploiting the relationship between them have been overlooked. In this dissertation, we explore what kind of relationship the tasks can present, along with the potential benefits that could be discovered from jointly formulating multiple tasks. The joint formulation allows each task to exploit the other task as an additional input cue and eventually improves the accuracy of the joint tasks.
We first present the joint estimation of semantic segmentation and optical flow. Though not directly related, the tasks provide an important cue to each other in the temporal domain. Semantic information can provide information on plausible physical motion of its associated pixels, and accurate pixel-level temporal correspondences enhance the temporal consistency of semantic segmentation. We demonstrate that the joint formulation improves the accuracy of both tasks.
Second, we investigate the mutual relationship between optical flow and occlusion estimation. Unlike most previous methods considering occlusions as outliers, we highlight the importance of jointly reasoning the two tasks in the optimization. Specifically through utilizing forward-backward consistency and occlusion-disocclusion symmetry in the energy, we demonstrate that the joint formulation brings substantial performance benefits for both tasks on standard benchmarks.
We further demonstrate that optical flow and occlusion can exploit their mutual relationship in Convolutional Neural Network as well. We propose to iteratively and residually refine the estimates using a single weight-shared network, which substantially improves the accuracy without adding network parameters or even reducing them depending on the backbone networks.
Next, we propose a joint depth and 3D scene flow estimation from only two temporally consecutive monocular images. We solve this ill-posed problem by taking an inverse problem view. We design a single Convolutional Neural Network that simultaneously estimates depth and 3D motion from a classical optical flow cost volume. With self-supervised learning, we leverage unlabeled data for training, without concerns about the shortage of 3D annotation for direct supervision.
Finally, we conclude by summarizing the contributions and discussing future perspectives that can resolve current challenges our approaches have