149 research outputs found
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment
3D panoptic segmentation is a challenging perception task that requires both
semantic segmentation and instance segmentation. In this task, we notice that
images could provide rich texture, color, and discriminative information, which
can complement LiDAR data for evident performance improvement, but their fusion
remains a challenging problem. To this end, we propose LCPS, the first
LiDAR-Camera Panoptic Segmentation network. In our approach, we conduct
LiDAR-Camera fusion in three stages: 1) an Asynchronous Compensation Pixel
Alignment (ACPA) module that calibrates the coordinate misalignment caused by
asynchronous problems between sensors; 2) a Semantic-Aware Region Alignment
(SARA) module that extends the one-to-one point-pixel mapping to one-to-many
semantic relations; 3) a Point-to-Voxel feature Propagation (PVP) module that
integrates both geometric and semantic fusion information for the entire point
cloud. Our fusion strategy improves about 6.9% PQ performance over the
LiDAR-only baseline on NuScenes dataset. Extensive quantitative and qualitative
experiments further demonstrate the effectiveness of our novel framework. The
code will be released at https://github.com/zhangzw12319/lcps.git.Comment: Accepted as ICCV 2023 pape
Farewell to Mutual Information: Variational Distillation for Cross-Modal Person Re-Identification
The Information Bottleneck (IB) provides an information theoretic principle
for representation learning, by retaining all information relevant for
predicting label while minimizing the redundancy. Though IB principle has been
applied to a wide range of applications, its optimization remains a challenging
problem which heavily relies on the accurate estimation of mutual information.
In this paper, we present a new strategy, Variational Self-Distillation (VSD),
which provides a scalable, flexible and analytic solution to essentially
fitting the mutual information but without explicitly estimating it. Under
rigorously theoretical guarantee, VSD enables the IB to grasp the intrinsic
correlation between representation and label for supervised training.
Furthermore, by extending VSD to multi-view learning, we introduce two other
strategies, Variational Cross-Distillation (VCD) and Variational
Mutual-Learning (VML), which significantly improve the robustness of
representation to view-changes by eliminating view-specific and task-irrelevant
information. To verify our theoretically grounded strategies, we apply our
approaches to cross-modal person Re-ID, and conduct extensive experiments,
where the superior performance against state-of-the-art methods are
demonstrated. Our intriguing findings highlight the need to rethink the way to
estimate mutua
Image Understands Point Cloud: Weakly Supervised 3D Semantic Segmentation via Association Learning
Weakly supervised point cloud semantic segmentation methods that require 1\%
or fewer labels, hoping to realize almost the same performance as fully
supervised approaches, which recently, have attracted extensive research
attention. A typical solution in this framework is to use self-training or
pseudo labeling to mine the supervision from the point cloud itself, but ignore
the critical information from images. In fact, cameras widely exist in LiDAR
scenarios and this complementary information seems to be greatly important for
3D applications. In this paper, we propose a novel cross-modality weakly
supervised method for 3D segmentation, incorporating complementary information
from unlabeled images. Basically, we design a dual-branch network equipped with
an active labeling strategy, to maximize the power of tiny parts of labels and
directly realize 2D-to-3D knowledge transfer. Afterwards, we establish a
cross-modal self-training framework in an Expectation-Maximum (EM) perspective,
which iterates between pseudo labels estimation and parameters updating. In the
M-Step, we propose a cross-modal association learning to mine complementary
supervision from images by reinforcing the cycle-consistency between 3D points
and 2D superpixels. In the E-step, a pseudo label self-rectification mechanism
is derived to filter noise labels thus providing more accurate labels for the
networks to get fully trained. The extensive experimental results demonstrate
that our method even outperforms the state-of-the-art fully supervised
competitors with less than 1\% actively selected annotations
Class-Imbalanced Semi-Supervised Learning for Large-Scale Point Cloud Semantic Segmentation via Decoupling Optimization
Semi-supervised learning (SSL), thanks to the significant reduction of data
annotation costs, has been an active research topic for large-scale 3D scene
understanding. However, the existing SSL-based methods suffer from severe
training bias, mainly due to class imbalance and long-tail distributions of the
point cloud data. As a result, they lead to a biased prediction for the tail
class segmentation. In this paper, we introduce a new decoupling optimization
framework, which disentangles feature representation learning and classifier in
an alternative optimization manner to shift the bias decision boundary
effectively. In particular, we first employ two-round pseudo-label generation
to select unlabeled points across head-to-tail classes. We further introduce
multi-class imbalanced focus loss to adaptively pay more attention to feature
learning across head-to-tail classes. We fix the backbone parameters after
feature learning and retrain the classifier using ground-truth points to update
its parameters. Extensive experiments demonstrate the effectiveness of our
method outperforming previous state-of-the-art methods on both indoor and
outdoor 3D point cloud datasets (i.e., S3DIS, ScanNet-V2, Semantic3D, and
SemanticKITTI) using 1% and 1pt evaluation
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
The autonomous driving community has shown significant interest in 3D
occupancy prediction, driven by its exceptional geometric perception and
general object recognition capabilities. To achieve this, current works try to
construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation
extending from the Bird-Eye-View perception. However, compressed views like TPV
representation lose 3D geometry information while raw and sparse OCC
representation requires heavy but redundant computational costs. To address the
above limitations, we propose Compact Occupancy TRansformer (COTR), with a
geometry-aware occupancy encoder and a semantic-aware group decoder to
reconstruct a compact 3D OCC representation. The occupancy encoder first
generates a compact geometrical OCC feature through efficient explicit-implicit
view transformation. Then, the occupancy decoder further enhances the semantic
discriminability of the compact OCC representation by a coarse-to-fine semantic
grouping strategy. Empirical experiments show that there are evident
performance gains across multiple baselines, e.g., COTR outperforms baselines
with a relative improvement of 8%-15%, demonstrating the superiority of our
method.Comment: CVPR2024. Code is available at https://github.com/NotACracker/COT
Instance-Aware Domain Generalization for Face Anti-Spoofing
Face anti-spoofing (FAS) based on domain generalization (DG) has been
recently studied to improve the generalization on unseen scenarios. Previous
methods typically rely on domain labels to align the distribution of each
domain for learning domain-invariant representations. However, artificial
domain labels are coarse-grained and subjective, which cannot reflect real
domain distributions accurately. Besides, such domain-aware methods focus on
domain-level alignment, which is not fine-grained enough to ensure that learned
representations are insensitive to domain styles. To address these issues, we
propose a novel perspective for DG FAS that aligns features on the instance
level without the need for domain labels. Specifically, Instance-Aware Domain
Generalization framework is proposed to learn the generalizable feature by
weakening the features' sensitivity to instance-specific styles. Concretely, we
propose Asymmetric Instance Adaptive Whitening to adaptively eliminate the
style-sensitive feature correlation, boosting the generalization. Moreover,
Dynamic Kernel Generator and Categorical Style Assembly are proposed to first
extract the instance-specific features and then generate the style-diversified
features with large style shifts, respectively, further facilitating the
learning of style-insensitive features. Extensive experiments and analysis
demonstrate the superiority of our method over state-of-the-art competitors.
Code will be publicly available at https://github.com/qianyuzqy/IADG.Comment: Accepted to IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 202
MotionMaster: Training-free Camera Motion Transfer For Video Generation
The emergence of diffusion models has greatly propelled the progress in image
and video generation. Recently, some efforts have been made in controllable
video generation, including text-to-video generation and video motion control,
among which camera motion control is an important topic. However, existing
camera motion control methods rely on training a temporal camera module, and
necessitate substantial computation resources due to the large amount of
parameters in video generation models. Moreover, existing methods pre-define
camera motion types during training, which limits their flexibility in camera
control. Therefore, to reduce training costs and achieve flexible camera
control, we propose COMD, a novel training-free video motion transfer model,
which disentangles camera motions and object motions in source videos and
transfers the extracted camera motions to new videos. We first propose a
one-shot camera motion disentanglement method to extract camera motion from a
single source video, which separates the moving objects from the background and
estimates the camera motion in the moving objects region based on the motion in
the background by solving a Poisson equation. Furthermore, we propose a
few-shot camera motion disentanglement method to extract the common camera
motion from multiple videos with similar camera motions, which employs a
window-based clustering technique to extract the common features in temporal
attention maps of multiple videos. Finally, we propose a motion combination
method to combine different types of camera motions together, enabling our
model a more controllable and flexible camera control. Extensive experiments
demonstrate that our training-free approach can effectively decouple
camera-object motion and apply the decoupled camera motion to a wide range of
controllable video generation tasks, achieving flexible and diverse camera
motion control
- …
