1,005 research outputs found
Understanding Dark Scenes by Contrasting Multi-Modal Observations
Understanding dark scenes based on multi-modal image data is challenging, as
both the visible and auxiliary modalities provide limited semantic information
for the task. Previous methods focus on fusing the two modalities but neglect
the correlations among semantic classes when minimizing losses to align pixels
with labels, resulting in inaccurate class predictions. To address these
issues, we introduce a supervised multi-modal contrastive learning approach to
increase the semantic discriminability of the learned multi-modal feature
spaces by jointly performing cross-modal and intra-modal contrast under the
supervision of the class correlations. The cross-modal contrast encourages
same-class embeddings from across the two modalities to be closer and pushes
different-class ones apart. The intra-modal contrast forces same-class or
different-class embeddings within each modality to be together or apart. We
validate our approach on a variety of tasks that cover diverse light conditions
and image modalities. Experiments show that our approach can effectively
enhance dark scene understanding based on multi-modal images with limited
semantics by shaping semantic-discriminative feature spaces. Comparisons with
previous methods demonstrate our state-of-the-art performance. Code and
pretrained models are available at https://github.com/palmdong/SMMCL
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition
We present AEGIS-Net, a novel indoor place recognition model that takes in
RGB point clouds and generates global place descriptors by aggregating
lower-level color, geometry features and higher-level implicit semantic
features. However, rather than simple feature concatenation, self-attention
modules are employed to select the most important local features that best
describe an indoor place. Our AEGIS-Net is made of a semantic encoder, a
semantic decoder and an attention-guided feature embedding. The model is
trained in a 2-stage process with the first stage focusing on an auxiliary
semantic segmentation task and the second one on the place recognition task. We
evaluate our AEGIS-Net on the ScanNetPR dataset and compare its performance
with a pre-deep-learning feature-based method and five state-of-the-art
deep-learning-based methods. Our AEGIS-Net achieves exceptional performance and
outperforms all six methods.Comment: Accepted by 2024 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2024
Learning to reconstruct and understand indoor scenes from sparse views
This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation for indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3~5) color images from uncalibrated sparse views, which significantly simplifies data acquisition and broadens applicable scenarios. To achieve promising 3D reconstruction from sparse views with limited overlap, our method first recovers the depth map and semantic information for each view, and then fuses the depth maps into a 3D scene. To this end, we design an iterative deep architecture, named IterNet, to estimate the depth map and semantic segmentation alternately. To obtain accurate alignment between views with limited overlap, we further propose a joint global and local registration method to reconstruct a 3D scene with semantic information. We also make available a new indoor synthetic dataset, containing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts. Experimental results on public datasets and our dataset demonstrate that our method achieves more accurate depth estimation, smaller semantic segmentation errors, and better 3D reconstruction results over state-of-the-art methods
DeepDR: Deep Structure-Aware RGB-D Inpainting for Diminished Reality
Diminished reality (DR) refers to the removal of real objects from the
environment by virtually replacing them with their background. Modern DR
frameworks use inpainting to hallucinate unobserved regions. While recent deep
learning-based inpainting is promising, the DR use case is complicated by the
need to generate coherent structure and 3D geometry (i.e., depth), in
particular for advanced applications, such as 3D scene editing. In this paper,
we propose DeepDR, a first RGB-D inpainting framework fulfilling all
requirements of DR: Plausible image and geometry inpainting with coherent
structure, running at real-time frame rates, with minimal temporal artifacts.
Our structure-aware generative network allows us to explicitly condition color
and depth outputs on the scene semantics, overcoming the difficulty of
reconstructing sharp and consistent boundaries in regions with complex
backgrounds. Experimental results show that the proposed framework can
outperform related work qualitatively and quantitatively.Comment: 11 pages, 8 figures + 13 pages, 10 figures supplementary. Accepted at
3DV 202
- …