17,988 research outputs found
Technical Report: Co-learning of geometry and semantics for online 3D mapping
This paper is a technical report about our submission for the ECCV 2018 3DRMS Workshop Challenge on Semantic 3D Reconstruction \cite{Tylecek2018rms}. In this paper, we address 3D semantic reconstruction for autonomous navigation using co-learning of depth map and semantic segmentation. The core of our pipeline is a deep multi-task neural network which tightly refines depth and also produces accurate semantic segmentation maps. Its inputs are an image and a raw depth map produced from a pair of images by standard stereo vision. The resulting semantic 3D point clouds are then merged in order to create a consistent 3D mesh, in turn used to produce dense semantic 3D reconstruction maps. The performances of each step of the proposed method are evaluated on the dataset and multiple tasks of the 3DRMS Challenge, and repeatedly surpass state-of-the-art approaches
Dense Voxel 3D Reconstruction Using a Monocular Event Camera
Event cameras are sensors inspired by biological systems that specialize in
capturing changes in brightness. These emerging cameras offer many advantages
over conventional frame-based cameras, including high dynamic range, high frame
rates, and extremely low power consumption. Due to these advantages, event
cameras have increasingly been adapted in various fields, such as frame
interpolation, semantic segmentation, odometry, and SLAM. However, their
application in 3D reconstruction for VR applications is underexplored. Previous
methods in this field mainly focused on 3D reconstruction through depth map
estimation. Methods that produce dense 3D reconstruction generally require
multiple cameras, while methods that utilize a single event camera can only
produce a semi-dense result. Other single-camera methods that can produce dense
3D reconstruction rely on creating a pipeline that either incorporates the
aforementioned methods or other existing Structure from Motion (SfM) or
Multi-view Stereo (MVS) methods. In this paper, we propose a novel approach for
solving dense 3D reconstruction using only a single event camera. To the best
of our knowledge, our work is the first attempt in this regard. Our preliminary
results demonstrate that the proposed method can produce visually
distinguishable dense 3D reconstructions directly without requiring pipelines
like those used by existing methods. Additionally, we have created a synthetic
dataset with object scans using an event camera simulator. This
dataset will help accelerate other relevant research in this field
SSR-2D: Semantic 3D Scene Reconstruction from 2D Images
Most deep learning approaches to comprehensive semantic modeling of 3D indoor
spaces require costly dense annotations in the 3D domain. In this work, we
explore a central 3D scene modeling task, namely, semantic scene reconstruction
without using any 3D annotations. The key idea of our approach is to design a
trainable model that employs both incomplete 3D reconstructions and their
corresponding source RGB-D images, fusing cross-domain features into volumetric
embeddings to predict complete 3D geometry, color, and semantics with only 2D
labeling which can be either manual or machine-generated. Our key technical
innovation is to leverage differentiable rendering of color and semantics to
bridge 2D observations and unknown 3D space, using the observed RGB images and
2D semantics as supervision, respectively. We additionally develop a learning
pipeline and corresponding method to enable learning from imperfect predicted
2D labels, which could be additionally acquired by synthesizing in an augmented
set of virtual training views complementing the original real captures,
enabling more efficient self-supervision loop for semantics. In this work, we
propose an end-to-end trainable solution jointly addressing geometry
completion, colorization, and semantic mapping from limited RGB-D images,
without relying on any 3D ground-truth information. Our method achieves
state-of-the-art performance of semantic scene reconstruction on two
large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines
even with costly 3D annotations. To our knowledge, our method is also the first
2D-driven method addressing completion and semantic segmentation of real-world
3D scans
CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction
Given the recent advances in depth prediction from Convolutional Neural
Networks (CNNs), this paper investigates how predicted depth maps from a deep
neural network can be deployed for accurate and dense monocular reconstruction.
We propose a method where CNN-predicted dense depth maps are naturally fused
together with depth measurements obtained from direct monocular SLAM. Our
fusion scheme privileges depth prediction in image locations where monocular
SLAM approaches tend to fail, e.g. along low-textured regions, and vice-versa.
We demonstrate the use of depth prediction for estimating the absolute scale of
the reconstruction, hence overcoming one of the major limitations of monocular
SLAM. Finally, we propose a framework to efficiently fuse semantic labels,
obtained from a single frame, with dense SLAM, yielding semantically coherent
scene reconstruction from a single view. Evaluation results on two benchmark
datasets show the robustness and accuracy of our approach.Comment: 10 pages, 6 figures, IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR), Hawaii, USA, June, 2017. The first two
authors contribute equally to this pape
- …