1,641 research outputs found
Part2Word: Learning Joint Embedding of Point Clouds and Text by Matching Parts to Words
It is important to learn joint embedding for 3D shapes and text in different
shape understanding tasks, such as shape-text matching, retrieval, and shape
captioning. Current multi-view based methods learn a mapping from multiple
rendered views to text. However, these methods can not analyze 3D shapes well
due to the self-occlusion and limitation of learning manifolds. To resolve this
issue, we propose a method to learn joint embedding of point clouds and text by
matching parts from shapes to words from sentences in a common space.
Specifically, we first learn segmentation prior to segment point clouds into
parts. Then, we map parts and words into an optimized space, where the parts
and words can be matched with each other. In the optimized space, we represent
a part by aggregating features of all points within the part, while
representing each word with its context information, where we train our network
to minimize the triplet ranking loss. Moreover, we also introduce cross-modal
attention to capture the relationship of part-word in this matching procedure,
which enhances joint embedding learning. Our experimental results outperform
the state-of-the-art in multi-modal retrieval under the widely used benchmark
UniDA3D: Unified Domain Adaptive 3D Semantic Segmentation Pipeline
State-of-the-art 3D semantic segmentation models are trained on off-the-shelf
public benchmarks, but they will inevitably face the challenge of recognition
accuracy drop when these well-trained models are deployed to a new domain. In
this paper, we introduce a Unified Domain Adaptive 3D semantic segmentation
pipeline (UniDA3D) to enhance the weak generalization ability, and bridge the
point distribution gap between domains. Different from previous studies that
only focus on a single adaptation task, UniDA3D can tackle several adaptation
tasks in 3D segmentation field, by designing a unified source-and-target active
sampling strategy, which selects a maximally-informative subset from both
source and target domains for effective model adaptation. Besides, benefiting
from the rise of multi-modal 2D-3D datasets, UniDA3D investigates the
possibility of achieving a multi-modal sampling strategy, by developing a
cross-modality feature interaction module that can extract a representative
pair of image and point features to achieve a bi-directional image-point
feature interaction for safe model adaptation. Experimentally, UniDA3D is
verified to be effective in many adaptation tasks including: 1) unsupervised
domain adaptation, 2) unsupervised few-shot domain adaptation; 3) active domain
adaptation. Their results demonstrate that, by easily coupling UniDA3D with
off-the-shelf 3D segmentation baselines, domain generalization ability of these
baselines can be enhanced
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation
In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts
a two-stage paradigm, extracting segmentation proposals and then matching them
with referring expressions. However, this conventional paradigm encounters
significant challenges, most notably in terms of the generation of lackluster
initial proposals and a pronounced deceleration in inference speed. Recognizing
these limitations, we introduce an innovative end-to-end Superpoint-Text
Matching Network (3D-STMN) that is enriched by dependency-driven insights. One
of the keystones of our model is the Superpoint-Text Matching (STM) mechanism.
Unlike traditional methods that navigate through instance proposals, STM
directly correlates linguistic indications with their respective superpoints,
clusters of semantically related points. This architectural decision empowers
our model to efficiently harness cross-modal semantic relationships, primarily
leveraging densely annotated superpoint-text pairs, as opposed to the more
sparse instance-text pairs. In pursuit of enhancing the role of text in guiding
the segmentation process, we further incorporate the Dependency-Driven
Interaction (DDI) module to deepen the network's semantic comprehension of
referring expressions. Using the dependency trees as a beacon, this module
discerns the intricate relationships between primary terms and their associated
descriptors in expressions, thereby elevating both the localization and
segmentation capacities of our model. Comprehensive experiments on the
ScanRefer benchmark reveal that our model not only set new performance
standards, registering an mIoU gain of 11.7 points but also achieve a
staggering enhancement in inference speed, surpassing traditional methods by
95.7 times. The code and models are available at
https://github.com/sosppxo/3D-STMN
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization
We study an important, yet largely unexplored problem of large-scale
cross-modal visual localization by matching ground RGB images to a
geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior
works were demonstrated on small datasets and did not lend themselves to
scaling up for large-scale applications. To enable large-scale evaluation, we
introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of
RGB and aerial LIDAR depth images. We propose a novel joint embedding based
method that effectively combines the appearance and semantic cues from both
modalities to handle drastic cross-modal variations. Experiments on the
proposed dataset show that our model achieves a strong result of a median rank
of 5 in matching across a large test set of 50K location pairs collected from a
14km^2 area. This represents a significant advancement over prior works in
performance and scale. We conclude with qualitative results to highlight the
challenging nature of this task and the benefits of the proposed model. Our
work provides a foundation for further research in cross-modal visual
localization.Comment: ACM Multimedia 202
PST: Plant segmentation transformer for 3D point clouds of rapeseed plants at the podding stage
Segmentation of plant point clouds to obtain high-precise morphological
traits is essential for plant phenotyping. Although the fast development of
deep learning has boosted much research on segmentation of plant point clouds,
previous studies mainly focus on the hard voxelization-based or
down-sampling-based methods, which are limited to segmenting simple plant
organs. Segmentation of complex plant point clouds with a high spatial
resolution still remains challenging. In this study, we proposed a deep
learning network plant segmentation transformer (PST) to achieve the semantic
and instance segmentation of rapeseed plants point clouds acquired by handheld
laser scanning (HLS) with the high spatial resolution, which can characterize
the tiny siliques as the main traits targeted. PST is composed of: (i) a
dynamic voxel feature encoder (DVFE) to aggregate the point features with the
raw spatial resolution; (ii) the dual window sets attention blocks to capture
the contextual information; and (iii) a dense feature propagation module to
obtain the final dense point feature map. The results proved that PST and
PST-PointGroup (PG) achieved superior performance in semantic and instance
segmentation tasks. For the semantic segmentation, the mean IoU, mean
Precision, mean Recall, mean F1-score, and overall accuracy of PST were 93.96%,
97.29%, 96.52%, 96.88%, and 97.07%, achieving an improvement of 7.62%, 3.28%,
4.8%, 4.25%, and 3.88% compared to the second-best state-of-the-art network
PAConv. For instance segmentation, PST-PG reached 89.51%, 89.85%, 88.83% and
82.53% in mCov, mWCov, mPerc90, and mRec90, achieving an improvement of 2.93%,
2.21%, 1.99%, and 5.9% compared to the original PG. This study proves that the
deep-learning-based point cloud segmentation method has a great potential for
resolving dense plant point clouds with complex morphological traits.Comment: 46 pages, 10 figure
- …