1,597 research outputs found
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
S4Net: Single Stage Salient-Instance Segmentation
We consider an interesting problem-salient instance segmentation in this
paper. Other than producing bounding boxes, our network also outputs
high-quality instance-level segments. Taking into account the
category-independent property of each target, we design a single stage salient
instance segmentation framework, with a novel segmentation branch. Our new
branch regards not only local context inside each detection window but also its
surrounding context, enabling us to distinguish the instances in the same scope
even with obstruction. Our network is end-to-end trainable and runs at a fast
speed (40 fps when processing an image with resolution 320x320). We evaluate
our approach on a publicly available benchmark and show that it outperforms
other alternative solutions. We also provide a thorough analysis of the design
choices to help readers better understand the functions of each part of our
network. The source code can be found at
\url{https://github.com/RuochenFan/S4Net}
Two Stream Scene Understanding on Graph Embedding
The paper presents a novel two-stream network architecture for enhancing
scene understanding in computer vision. This architecture utilizes a graph
feature stream and an image feature stream, aiming to merge the strengths of
both modalities for improved performance in image classification and scene
graph generation tasks. The graph feature stream network comprises a
segmentation structure, scene graph generation, and a graph representation
module. The segmentation structure employs the UPSNet architecture with a
backbone that can be a residual network, Vit, or Swin Transformer. The scene
graph generation component focuses on extracting object labels and neighborhood
relationships from the semantic map to create a scene graph. Graph
Convolutional Networks (GCN), GraphSAGE, and Graph Attention Networks (GAT) are
employed for graph representation, with an emphasis on capturing node features
and their interconnections. The image feature stream network, on the other
hand, focuses on image classification through the use of Vision Transformer and
Swin Transformer models. The two streams are fused using various data fusion
methods. This fusion is designed to leverage the complementary strengths of
graph-based and image-based features.Experiments conducted on the ADE20K
dataset demonstrate the effectiveness of the proposed two-stream network in
improving image classification accuracy compared to conventional methods. This
research provides a significant contribution to the field of computer vision,
particularly in the areas of scene understanding and image classification, by
effectively combining graph-based and image-based approaches
- …