100 research outputs found
Introspective Deep Metric Learning for Image Retrieval
This paper proposes an introspective deep metric learning (IDML) framework
for uncertainty-aware comparisons of images. Conventional deep metric learning
methods produce confident semantic distances between images regardless of the
uncertainty level. However, we argue that a good similarity model should
consider the semantic discrepancies with caution to better deal with ambiguous
images for more robust training. To achieve this, we propose to represent an
image using not only a semantic embedding but also an accompanying uncertainty
embedding, which describes the semantic characteristics and ambiguity of an
image, respectively. We further propose an introspective similarity metric to
make similarity judgments between images considering both their semantic
differences and ambiguities. The proposed IDML framework improves the
performance of deep metric learning through uncertainty modeling and attains
state-of-the-art results on the widely used CUB-200-2011, Cars196, and Stanford
Online Products datasets for image retrieval and clustering. We further provide
an in-depth analysis of our framework to demonstrate the effectiveness and
reliability of IDML. Code is available at: https://github.com/wzzheng/IDML.Comment: The extended version of this paper is accepted to T-PAMI. Source code
available at https://github.com/wzzheng/IDM
LiDAR-HMR: 3D Human Mesh Recovery from LiDAR
In recent years, point cloud perception tasks have been garnering increasing
attention. This paper presents the first attempt to estimate 3D human body mesh
from sparse LiDAR point clouds. We found that the major challenge in estimating
human pose and mesh from point clouds lies in the sparsity, noise, and
incompletion of LiDAR point clouds. Facing these challenges, we propose an
effective sparse-to-dense reconstruction scheme to reconstruct 3D human mesh.
This involves estimating a sparse representation of a human (3D human pose) and
gradually reconstructing the body mesh. To better leverage the 3D structural
information of point clouds, we employ a cascaded graph transformer
(graphormer) to introduce point cloud features during sparse-to-dense
reconstruction. Experimental results on three publicly available databases
demonstrate the effectiveness of the proposed approach. Code:
https://github.com/soullessrobot/LiDAR-HMR/Comment: Code is available at: https://github.com/soullessrobot/LiDAR-HMR
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
3D occupancy prediction is an important task for the robustness of
vision-centric autonomous driving, which aims to predict whether each point is
occupied in the surrounding 3D space. Existing methods usually require 3D
occupancy labels to produce meaningful results. However, it is very laborious
to annotate the occupancy status of each voxel. In this paper, we propose
SelfOcc to explore a self-supervised way to learn 3D occupancy using only video
sequences. We first transform the images into the 3D space (e.g., bird's eye
view) to obtain 3D representation of the scene. We directly impose constraints
on the 3D representations by treating them as signed distance fields. We can
then render 2D images of previous and future frames as self-supervision signals
to learn the 3D representations. We propose an MVS-embedded strategy to
directly optimize the SDF-induced weights with multiple depth proposals. Our
SelfOcc outperforms the previous best method SceneRF by 58.7% using a single
frame as input on SemanticKITTI and is the first self-supervised work that
produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc
produces high-quality depth and achieves state-of-the-art results on novel
depth synthesis, monocular depth estimation, and surround-view depth estimation
on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code:
https://github.com/huang-yh/SelfOcc.Comment: Code is available at: https://github.com/huang-yh/SelfOc
Modified Instantaneous Power Control with Phase Compensation and Current-limited Function under Unbalanced Grid Faults
OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions
The pretrain-finetune paradigm in modern computer vision facilitates the
success of self-supervised learning, which tends to achieve better
transferability than supervised learning. However, with the availability of
massive labeled data, a natural question emerges: how to train a better model
with both self and full supervision signals? In this paper, we propose
Omni-suPErvised Representation leArning with hierarchical supervisions (OPERA)
as a solution. We provide a unified perspective of supervisions from labeled
and unlabeled data and propose a unified framework of fully supervised and
self-supervised learning. We extract a set of hierarchical proxy
representations for each image and impose self and full supervisions on the
corresponding proxy representations. Extensive experiments on both
convolutional neural networks and vision transformers demonstrate the
superiority of OPERA in image classification, segmentation, and object
detection. Code is available at: https://github.com/wangck20/OPERA.Comment: Source code available at: https://github.com/wangck20/OPER
Exploring Unified Perspective For Fast Shapley Value Estimation
Shapley values have emerged as a widely accepted and trustworthy tool,
grounded in theoretical axioms, for addressing challenges posed by black-box
models like deep neural networks. However, computing Shapley values encounters
exponential complexity in the number of features. Various approaches, including
ApproSemivalue, KernelSHAP, and FastSHAP, have been explored to expedite the
computation. We analyze the consistency of existing works and conclude that
stochastic estimators can be unified as the linear transformation of importance
sampling of feature subsets. Based on this, we investigate the possibility of
designing simple amortized estimators and propose a straightforward and
efficient one, SimSHAP, by eliminating redundant techniques. Extensive
experiments conducted on tabular and image datasets validate the effectiveness
of our SimSHAP, which significantly accelerates the computation of accurate
Shapley values
Token-Label Alignment for Vision Transformers
Data mixing strategies (e.g., CutMix) have shown the ability to greatly
improve the performance of convolutional neural networks (CNNs). They mix two
images as inputs for training and assign them with a mixed label with the same
ratio. While they are shown effective for vision transformers (ViTs), we
identify a token fluctuation phenomenon that has suppressed the potential of
data mixing strategies. We empirically observe that the contributions of input
tokens fluctuate as forward propagating, which might induce a different mixing
ratio in the output tokens. The training target computed by the original data
mixing strategy can thus be inaccurate, resulting in less effective training.
To address this, we propose a token-label alignment (TL-Align) method to trace
the correspondence between transformed tokens and the original tokens to
maintain a label for each token. We reuse the computed attention at each layer
for efficient token-label alignment, introducing only negligible additional
training costs. Extensive experiments demonstrate that our method improves the
performance of ViTs on image classification, semantic segmentation, objective
detection, and transfer learning tasks. Code is available at:
https://github.com/Euphoria16/TL-Align.Comment: Source code available at https://github.com/Euphoria16/TL-Alig
PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction
Semantic segmentation in autonomous driving has been undergoing an evolution
from sparse point segmentation to dense voxel segmentation, where the objective
is to predict the semantic occupancy of each voxel in the concerned 3D space.
The dense nature of the prediction space has rendered existing efficient
2D-projection-based methods (e.g., bird's eye view, range view, etc.)
ineffective, as they can only describe a subspace of the 3D scene. To address
this, we propose a cylindrical tri-perspective view to represent point clouds
effectively and comprehensively and a PointOcc model to process them
efficiently. Considering the distance distribution of LiDAR point clouds, we
construct the tri-perspective view in the cylindrical coordinate system for
more fine-grained modeling of nearer areas. We employ spatial group pooling to
maintain structural details during projection and adopt 2D backbones to
efficiently process each TPV plane. Finally, we obtain the features of each
point by aggregating its projected features on each of the processed TPV planes
without the need for any post-processing. Extensive experiments on both 3D
occupancy prediction and LiDAR segmentation benchmarks demonstrate that the
proposed PointOcc achieves state-of-the-art performance with much faster speed.
Specifically, despite only using LiDAR, PointOcc significantly outperforms all
other methods, including multi-modal methods, with a large margin on the
OpenOccupancy benchmark. Code: https://github.com/wzzheng/PointOcc.Comment: Code is available at https://github.com/wzzheng/PointOc
- …