16 research outputs found
Memory Consistency Guided Divide-and-Conquer Learning for Generalized Category Discovery
Generalized category discovery (GCD) aims at addressing a more realistic and
challenging setting of semi-supervised learning, where only part of the
category labels are assigned to certain training samples. Previous methods
generally employ naive contrastive learning or unsupervised clustering scheme
for all the samples. Nevertheless, they usually ignore the inherent critical
information within the historical predictions of the model being trained.
Specifically, we empirically reveal that a significant number of salient
unlabeled samples yield consistent historical predictions corresponding to
their ground truth category. From this observation, we propose a Memory
Consistency guided Divide-and-conquer Learning framework (MCDL). In this
framework, we introduce two memory banks to record historical prediction of
unlabeled data, which are exploited to measure the credibility of each sample
in terms of its prediction consistency. With the guidance of credibility, we
can design a divide-and-conquer learning strategy to fully utilize the
discriminative information of unlabeled data while alleviating the negative
influence of noisy labels. Extensive experimental results on multiple
benchmarks demonstrate the generality and superiority of our method, where our
method outperforms state-of-the-art models by a large margin on both seen and
unseen classes of the generic image recognition and challenging semantic shift
settings (i.e.,with +8.4% gain on CUB and +8.1% on Standford Cars)
Generalized Few-shot Semantic Segmentation
Training semantic segmentation models requires a large amount of finely
annotated data, making it hard to quickly adapt to novel classes not satisfying
this condition. Few-Shot Segmentation (FS-Seg) tackles this problem with many
constraints. In this paper, we introduce a new benchmark, called Generalized
Few-Shot Semantic Segmentation (GFS-Seg), to analyze the generalization ability
of simultaneously segmenting the novel categories with very few examples and
the base categories with sufficient examples. It is the first study showing
that previous representative state-of-the-art FS-Seg methods fall short in
GFS-Seg and the performance discrepancy mainly comes from the constrained
setting of FS-Seg. To make GFS-Seg tractable, we set up a GFS-Seg baseline that
achieves decent performance without structural change on the original model.
Then, since context is essential for semantic segmentation, we propose the
Context-Aware Prototype Learning (CAPL) that significantly improves performance
by 1) leveraging the co-occurrence prior knowledge from support samples, and 2)
dynamically enriching contextual information to the classifier, conditioned on
the content of each query image. Both two contributions are experimentally
shown to have substantial practical merit. Extensive experiments on Pascal-VOC
and COCO manifest the effectiveness of CAPL, and CAPL generalizes well to
FS-Seg by achieving competitive performance. Code will be made publicly
available
Region Refinement Network for Salient Object Detection
Albeit intensively studied, false prediction and unclear boundaries are still
major issues of salient object detection. In this paper, we propose a Region
Refinement Network (RRN), which recurrently filters redundant information and
explicitly models boundary information for saliency detection. Different from
existing refinement methods, we propose a Region Refinement Module (RRM) that
optimizes salient region prediction by incorporating supervised attention masks
in the intermediate refinement stages. The module only brings a minor increase
in model size and yet significantly reduces false predictions from the
background. To further refine boundary areas, we propose a Boundary Refinement
Loss (BRL) that adds extra supervision for better distinguishing foreground
from background. BRL is parameter free and easy to train. We further observe
that BRL helps retain the integrity in prediction by refining the boundary.
Extensive experiments on saliency detection datasets show that our refinement
module and loss bring significant improvement to the baseline and can be easily
applied to different frameworks. We also demonstrate that our proposed model
generalizes well to portrait segmentation and shadow detection tasks
GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding
Self-supervised 3D representation learning aims to learn effective
representations from large-scale unlabeled point clouds. Most existing
approaches adopt point discrimination as the pretext task, which assigns
matched points in two distinct views as positive pairs and unmatched points as
negative pairs. However, this approach often results in semantically identical
points having dissimilar representations, leading to a high number of false
negatives and introducing a "semantic conflict" problem. To address this issue,
we propose GroupContrast, a novel approach that combines segment grouping and
semantic-aware contrastive learning. Segment grouping partitions points into
semantically meaningful regions, which enhances semantic coherence and provides
semantic guidance for the subsequent contrastive representation learning.
Semantic-aware contrastive learning augments the semantic information extracted
from segment grouping and helps to alleviate the issue of "semantic conflict".
We conducted extensive experiments on multiple 3D scene understanding tasks.
The results demonstrate that GroupContrast learns semantically meaningful
representations and achieves promising transfer learning performance.Comment: CVPR 202
VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection
In recent years, transformer-based detectors have demonstrated remarkable
performance in 2D visual perception tasks. However, their performance in
multi-view 3D object detection remains inferior to the state-of-the-art (SOTA)
of convolutional neural network based detectors. In this work, we investigate
this issue from the perspective of bird's-eye-view (BEV) feature generation.
Specifically, we examine the BEV feature generation method employed by the
transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it
only generates attention weights from BEV, which precludes the use of lidar
points for supervision, and (ii) it aggregates camera view features to the BEV
through deformable sampling, which only selects a small subset of features and
fails to exploit all information. To overcome these limitations, we propose a
novel BEV feature generation method, dual-view attention, which generates
attention weights from both the BEV and camera view. This method encodes all
camera features into the BEV feature. By combining dual-view attention with the
BEVFormer architecture, we build a new detector named VoxelFormer. Extensive
experiments are conducted on the nuScenes benchmark to verify the superiority
of dual-view attention and VoxelForer. We observe that even only adopting 3
encoders and 1 historical frame during training, VoxelFormer still outperforms
BEVFormer significantly. When trained in the same setting, VoxelFormer can
surpass BEVFormer by 4.9% NDS point. Code is available at:
https://github.com/Lizhuoling/VoxelFormer-public.git
: Backward-compatible Training with Basis Transformation
Modern retrieval system often requires recomputing the representation of
every piece of data in the gallery when updating to a better representation
model. This process is known as backfilling and can be especially costly in the
real world where the gallery often contains billions of samples. Recently,
researchers have proposed the idea of Backward Compatible Training (BCT) where
the new representation model can be trained with an auxiliary loss to make it
backward compatible with the old representation. In this way, the new
representation can be directly compared with the old representation, in
principle avoiding the need for any backfilling. However, followup work shows
that there is an inherent tradeoff where a backward compatible representation
model cannot simultaneously maintain the performance of the new model itself.
This paper reports our ``not-so-surprising'' finding that adding extra
dimensions to the representation can help here. However, we also found that
naively increasing the dimension of the representation did not work. To deal
with this, we propose Backward-compatible Training with a novel Basis
Transformation (). A basis transformation (BT) is basically a learnable
set of parameters that applies an orthonormal transformation. Such a
transformation possesses an important property whereby the original information
contained in its input is retained in its output. We show in this paper how a
BT can be utilized to add only the necessary amount of additional dimensions.
We empirically verify the advantage of over other state-of-the-art
methods in a wide range of settings. We then further extend to other
challenging yet more practical settings, including significant change in model
architecture (CNN to Transformers), modality change, and even a series of
updates in the model architecture mimicking the evolution of deep learning
models.Comment: 13 pages, 2 figure
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
Multimodal large language models (MLLMs) have emerged as a prominent area of
interest within the research community, given their proficiency in handling and
reasoning with non-textual data, including images and videos. This study seeks
to extend the application of MLLMs to the realm of autonomous driving by
introducing DriveGPT4, a novel interpretable end-to-end autonomous driving
system based on LLMs. Capable of processing multi-frame video inputs and
textual queries, DriveGPT4 facilitates the interpretation of vehicle actions,
offers pertinent reasoning, and effectively addresses a diverse range of
questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle
control signals in an end-to-end fashion. These advanced capabilities are
achieved through the utilization of a bespoke visual instruction tuning
dataset, specifically tailored for autonomous driving applications, in
conjunction with a mix-finetuning training strategy. DriveGPT4 represents the
pioneering effort to leverage LLMs for the development of an interpretable
end-to-end autonomous driving solution. Evaluations conducted on the BDD-X
dataset showcase the superior qualitative and quantitative performance of
DriveGPT4. Additionally, the fine-tuning of domain-specific data enables
DriveGPT4 to yield close or even improved results in terms of autonomous
driving grounding when contrasted with GPT4-V. The code and dataset will be
publicly available.Comment: The project page is available at
https://tonyxuqaq.github.io/projects/DriveGPT4