184 research outputs found
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling
Masked Image Modeling (MIM) has achieved promising progress with the advent
of Masked Autoencoders (MAE) and BEiT. However, subsequent works have
complicated the framework with new auxiliary tasks or extra pre-trained models,
inevitably increasing computational overhead. This paper undertakes a
fundamental analysis of MIM from the perspective of pixel reconstruction, which
examines the input image patches and reconstruction target, and highlights two
critical but previously overlooked bottlenecks. Based on this analysis, we
propose a remarkably simple and effective method, {\ourmethod}, that entails
two strategies: 1) filtering the high-frequency components from the
reconstruction target to de-emphasize the network's focus on texture-rich
details and 2) adopting a conservative data transform strategy to alleviate the
problem of missing foreground in MIM training. {\ourmethod} can be easily
integrated into most existing pixel-based MIM approaches (\ie, using raw images
as reconstruction target) with negligible additional computation. Without bells
and whistles, our method consistently improves three MIM approaches, MAE,
ConvMAE, and LSMAE, across various downstream tasks. We believe this effective
plug-and-play method will serve as a strong baseline for self-supervised
learning and provide insights for future improvements of the MIM framework.
Code and models are available at
\url{https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim}.Comment: Update code link and add additional result
Collaborative Propagation on Multiple Instance Graphs for 3D Instance Segmentation with Single-point Supervision
Instance segmentation on 3D point clouds has been attracting increasing
attention due to its wide applications, especially in scene understanding
areas. However, most existing methods operate on fully annotated data while
manually preparing ground-truth labels at point-level is very cumbersome and
labor-intensive. To address this issue, we propose a novel weakly supervised
method RWSeg that only requires labeling one object with one point. With these
sparse weak labels, we introduce a unified framework with two branches to
propagate semantic and instance information respectively to unknown regions
using self-attention and a cross-graph random walk method. Specifically, we
propose a Cross-graph Competing Random Walks (CRW) algorithm that encourages
competition among different instance graphs to resolve ambiguities in closely
placed objects, improving instance assignment accuracy. RWSeg generates
high-quality instance-level pseudo labels. Experimental results on ScanNet-v2
and S3DIS datasets show that our approach achieves comparable performance with
fully-supervised methods and outperforms previous weakly-supervised methods by
a substantial margin
Learning-Based Biharmonic Augmentation for Point Cloud Classification
Point cloud datasets often suffer from inadequate sample sizes in comparison
to image datasets, making data augmentation challenging. While traditional
methods, like rigid transformations and scaling, have limited potential in
increasing dataset diversity due to their constraints on altering individual
sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a
novel and efficient data augmentation technique that diversifies point cloud
data by imposing smooth non-rigid deformations on existing 3D structures. This
approach calculates biharmonic coordinates for the deformation function and
learns diverse deformation prototypes. Utilizing a CoefNet, our method predicts
coefficients to amalgamate these prototypes, ensuring comprehensive
deformation. Moreover, we present AdvTune, an advanced online augmentation
system that integrates adversarial training. This system synergistically
refines the CoefNet and the classification network, facilitating the automated
creation of adaptive shape deformations contingent on the learner status.
Comprehensive experimental analysis validates the superiority of Biharmonic
Augmentation, showcasing notable performance improvements over prevailing point
cloud augmentation techniques across varied network designs
Multi-Path Region Mining For Weakly Supervised 3D Semantic Segmentation on Point Clouds
Point clouds provide intrinsic geometric information and surface context for
scene understanding. Existing methods for point cloud segmentation require a
large amount of fully labeled data. Using advanced depth sensors, collection of
large scale 3D dataset is no longer a cumbersome process. However, manually
producing point-level label on the large scale dataset is time and
labor-intensive. In this paper, we propose a weakly supervised approach to
predict point-level results using weak labels on 3D point clouds. We introduce
our multi-path region mining module to generate pseudo point-level label from a
classification network trained with weak labels. It mines the localization cues
for each class from various aspects of the network feature using different
attention modules. Then, we use the point-level pseudo labels to train a point
cloud segmentation network in a fully supervised manner. To the best of our
knowledge, this is the first method that uses cloud-level weak labels on raw 3D
space to train a point cloud semantic segmentation network. In our setting, the
3D weak labels only indicate the classes that appeared in our input sample. We
discuss both scene- and subcloud-level weakly labels on raw 3D point cloud data
and perform in-depth experiments on them. On ScanNet dataset, our result
trained with subcloud-level labels is compatible with some fully supervised
methods.Comment: Accepted by CVPR202
Bi-Mapper: Holistic BEV Semantic Mapping for Autonomous Driving
A semantic map of the road scene, covering fundamental road elements, is an
essential ingredient in autonomous driving systems. It provides important
perception foundations for positioning and planning when rendered in the
Bird's-Eye-View (BEV). Currently, the prior knowledge of hypothetical depth can
guide the learning of translating front perspective views into BEV directly
with the help of calibration parameters. However, it suffers from geometric
distortions in the representation of distant objects. In addition, another
stream of methods without prior knowledge can learn the transformation between
front perspective views and BEV implicitly with a global view. Considering that
the fusion of different learning methods may bring surprising beneficial
effects, we propose a Bi-Mapper framework for top-down road-scene semantic
understanding, which incorporates a global view and local prior knowledge. To
enhance reliable interaction between them, an asynchronous mutual learning
strategy is proposed. At the same time, an Across-Space Loss (ASL) is designed
to mitigate the negative impact of geometric distortions. Extensive results on
nuScenes and Cam2BEV datasets verify the consistent effectiveness of each
module in the proposed Bi-Mapper framework. Compared with exiting road mapping
networks, the proposed Bi-Mapper achieves 2.1% higher IoU on the nuScenes
dataset. Moreover, we verify the generalization performance of Bi-Mapper in a
real-world driving scenario. The source code is publicly available at
https://github.com/lynn-yu/Bi-Mapper.Comment: Accepted to IEEE Robotics and Automation Letters (RA-L). The source
code is publicly available at https://github.com/lynn-yu/Bi-Mappe
EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object
Segmentation (R-VOS) are two highly-related tasks, which both aim to segment
specific objects from video sequences according to user-provided expression
prompts. However, due to the challenges in modeling representations for
different modalities, contemporary methods struggle to strike a balance between
interaction flexibility and high-precision localization and segmentation. In
this paper, we address this problem from two perspectives: the alignment
representation of audio and text and the deep interaction among audio, text,
and visual features. First, we propose a universal architecture, the Expression
Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an
Expression Alignment (EA) mechanism for audio and text expressions. By
introducing contrastive learning for audio and text expressions, the proposed
EPCFormer realizes comprehension of the semantic equivalence between audio and
text expressions denoting the same objects. Then, to facilitate deep
interactions among audio, text, and video features, we introduce an
Expression-Visual Attention (EVA) mechanism. The knowledge of video object
segmentation in terms of the expression prompts can seamlessly transfer between
the two tasks by deeply exploring complementary cues between text and audio.
Experiments on well-recognized benchmarks demonstrate that our universal
EPCFormer attains state-of-the-art results on both tasks. The source code of
EPCFormer will be made publicly available at
https://github.com/lab206/EPCFormer.Comment: The source code will be made publicly available at
https://github.com/lab206/EPCForme
MoDA: Modeling Deformable 3D Objects from Casual Videos
In this paper, we focus on the challenges of modeling deformable 3D objects
from casual videos. With the popularity of neural radiance fields (NeRF), many
works extend it to dynamic scenes with a canonical NeRF and a deformation model
that achieves 3D point transformation between the observation space and the
canonical space. Recent works rely on linear blend skinning (LBS) to achieve
the canonical-observation transformation. However, the linearly weighted
combination of rigid transformation matrices is not guaranteed to be rigid. As
a matter of fact, unexpected scale and shear factors often appear. In practice,
using LBS as the deformation model can always lead to skin-collapsing artifacts
for bending or twisting motions. To solve this problem, we propose neural dual
quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can
perform rigid transformation without skin-collapsing artifacts. In the endeavor
to register 2D pixels across different frames, we establish a correspondence
between canonical feature embeddings that encodes 3D points within the
canonical space, and 2D image features by solving an optimal transport problem.
Besides, we introduce a texture filtering approach for texture rendering that
effectively minimizes the impact of noisy colors outside target deformable
objects. Extensive experiments on real and synthetic datasets show that our
approach can reconstruct 3D models for humans and animals with better
qualitative and quantitative performance than state-of-the-art methods
SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection
Transformer-based methods have demonstrated superior performance for
monocular 3D object detection recently, which aims at predicting 3D attributes
from a single 2D image. Most existing transformer-based methods leverage both
visual and depth representations to explore valuable query points on objects,
and the quality of the learned query points has a great impact on detection
accuracy. Unfortunately, existing unsupervised attention mechanisms in
transformers are prone to generate low-quality query features due to inaccurate
receptive fields, especially on hard objects. To tackle this problem, this
paper proposes a novel Supervised Scale-aware Deformable Attention (SSDA) for
monocular 3D object detection. Specifically, SSDA presets several masks with
different scales and utilizes depth and visual features to adaptively learn a
scale-aware filter for object query augmentation. Imposing the scale awareness,
SSDA could well predict the accurate receptive field of an object query to
support robust query feature generation. Aside from this, SSDA is assigned with
a Weighted Scale Matching (WSM) loss to supervise scale prediction, which
presents more confident results as compared to the unsupervised attention
mechanisms. Extensive experiments on the KITTI benchmark demonstrate that SSDA
significantly improves the detection accuracy, especially on moderate and hard
objects, yielding state-of-the-art performance as compared to the existing
approaches. Our code will be made publicly available at
https://github.com/mikasa3lili/SSD-MonoDETR.Comment: Code will be made publicly available at
https://github.com/mikasa3lili/SSD-MonoDET
- …