17,982 research outputs found
GNeSF: Generalizable Neural Semantic Fields
3D scene segmentation based on neural implicit representation has emerged
recently with the advantage of training only on 2D supervision. However,
existing approaches still requires expensive per-scene optimization that
prohibits generalization to novel scenes during inference. To circumvent this
problem, we introduce a generalizable 3D segmentation framework based on
implicit representation. Specifically, our framework takes in multi-view image
features and semantic maps as the inputs instead of only spatial information to
avoid overfitting to scene-specific geometric and semantic information. We
propose a novel soft voting mechanism to aggregate the 2D semantic information
from different views for each 3D point. In addition to the image features, view
difference information is also encoded in our framework to predict the voting
scores. Intuitively, this allows the semantic information from nearby views to
contribute more compared to distant ones. Furthermore, a visibility module is
also designed to detect and filter out detrimental information from occluded
views. Due to the generalizability of our proposed method, we can synthesize
semantic maps or conduct 3D semantic segmentation for novel scenes with solely
2D semantic supervision. Experimental results show that our approach achieves
comparable performance with scene-specific approaches. More importantly, our
approach can even outperform existing strong supervision-based approaches with
only 2D annotations. Our source code is available at:
https://github.com/HLinChen/GNeSF.Comment: NeurIPS 202
Learning to reconstruct and understand indoor scenes from sparse views
This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation for indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3~5) color images from uncalibrated sparse views, which significantly simplifies data acquisition and broadens applicable scenarios. To achieve promising 3D reconstruction from sparse views with limited overlap, our method first recovers the depth map and semantic information for each view, and then fuses the depth maps into a 3D scene. To this end, we design an iterative deep architecture, named IterNet, to estimate the depth map and semantic segmentation alternately. To obtain accurate alignment between views with limited overlap, we further propose a joint global and local registration method to reconstruct a 3D scene with semantic information. We also make available a new indoor synthetic dataset, containing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts. Experimental results on public datasets and our dataset demonstrate that our method achieves more accurate depth estimation, smaller semantic segmentation errors, and better 3D reconstruction results over state-of-the-art methods
Implicit Ray-Transformers for Multi-view Remote Sensing Image Segmentation
The mainstream CNN-based remote sensing (RS) image semantic segmentation
approaches typically rely on massive labeled training data. Such a paradigm
struggles with the problem of RS multi-view scene segmentation with limited
labeled views due to the lack of considering 3D information within the scene.
In this paper, we propose ''Implicit Ray-Transformer (IRT)'' based on Implicit
Neural Representation (INR), for RS scene semantic segmentation with sparse
labels (such as 4-6 labels per 100 images). We explore a new way of introducing
multi-view 3D structure priors to the task for accurate and view-consistent
semantic segmentation. The proposed method includes a two-stage learning
process. In the first stage, we optimize a neural field to encode the color and
3D structure of the remote sensing scene based on multi-view images. In the
second stage, we design a Ray Transformer to leverage the relations between the
neural field 3D features and 2D texture features for learning better semantic
representations. Different from previous methods that only consider 3D prior or
2D features, we incorporate additional 2D texture information and 3D prior by
broadcasting CNN features to different point features along the sampled ray. To
verify the effectiveness of the proposed method, we construct a challenging
dataset containing six synthetic sub-datasets collected from the Carla platform
and three real sub-datasets from Google Maps. Experiments show that the
proposed method outperforms the CNN-based methods and the state-of-the-art
INR-based segmentation methods in quantitative and qualitative metrics
Temporal Bird’s Eye View for 3D Semantic Segmentation
Due to the growing importance of autonomous robots and vehicles, 3D semantic segmentation, a key task of 3D scene understanding, has become more and more important. Despite its sequential nature in real-time scenarios, 3D semantic segmentation is often approached as single frame problem. However, temporal dependencies and information offer a huge potential to improve the predictions. Therefore, we propose a recurrent temporal architecture for 3D semantic segmentation, which exploits temporal information at the input and feature stage, to maximize the temporal benefits. Aggregated point clouds in bird’s eye view increase the information provided to the backbone and temporally fused feature maps exploit temporal dependencies on feature level. The experiments conducted on a challenging and large-scale outdoor dataset show considerable improvements compared to a single frame baseline. The temporal information improve the results for every individual class
Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Precise 3D environmental mapping is pivotal in robotics. Existing methods
often rely on predefined concepts during training or are time-intensive when
generating semantic maps. This paper presents Open-Fusion, a groundbreaking
approach for real-time open-vocabulary 3D mapping and queryable scene
representation using RGB-D data. Open-Fusion harnesses the power of a
pre-trained vision-language foundation model (VLFM) for open-set semantic
comprehension and employs the Truncated Signed Distance Function (TSDF) for
swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based
embeddings and their associated confidence maps. These are then integrated with
3D knowledge from TSDF using an enhanced Hungarian-based feature-matching
mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D
segmentation for open-vocabulary without necessitating additional 3D training.
Benchmark tests on the ScanNet dataset against leading zero-shot methods
highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the
strengths of region-based VLFM and TSDF, facilitating real-time 3D scene
comprehension that includes object concepts and open-world semantics. We
encourage the readers to view the demos on our project page:
https://uark-aicv.github.io/OpenFusio
Lifting GIS Maps into Strong Geometric Context for Scene Understanding
Contextual information can have a substantial impact on the performance of
visual tasks such as semantic segmentation, object detection, and geometric
estimation. Data stored in Geographic Information Systems (GIS) offers a rich
source of contextual information that has been largely untapped by computer
vision. We propose to leverage such information for scene understanding by
combining GIS resources with large sets of unorganized photographs using
Structure from Motion (SfM) techniques. We present a pipeline to quickly
generate strong 3D geometric priors from 2D GIS data using SfM models aligned
with minimal user input. Given an image resectioned against this model, we
generate robust predictions of depth, surface normals, and semantic labels. We
show that the precision of the predicted geometry is substantially more
accurate other single-image depth estimation methods. We then demonstrate the
utility of these contextual constraints for re-scoring pedestrian detections,
and use these GIS contextual features alongside object detection score maps to
improve a CRF-based semantic segmentation framework, boosting accuracy over
baseline models
- …