45 research outputs found
MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds
The sensing process of large-scale LiDAR point clouds inevitably causes large
blind spots, i.e. regions not visible to the sensor. We demonstrate how these
inherent sampling properties can be effectively utilized for self-supervised
representation learning by designing a highly effective pre-training framework
that considerably reduces the need for tedious 3D annotations to train
state-of-the-art object detectors. Our Masked AutoEncoder for LiDAR point
clouds (MAELi) intuitively leverages the sparsity of LiDAR point clouds in both
the encoder and decoder during reconstruction. This results in more expressive
and useful initialization, which can be directly applied to downstream
perception tasks, such as 3D object detection or semantic segmentation for
autonomous driving. In a novel reconstruction approach, MAELi distinguishes
between empty and occluded space and employs a new masking strategy that
targets the LiDAR's inherent spherical projection. Thereby, without any ground
truth whatsoever and trained on single frames only, MAELi obtains an
understanding of the underlying 3D scene geometry and semantics. To demonstrate
the potential of MAELi, we pre-train backbones in an end-to-end manner and show
the effectiveness of our unsupervised pre-trained weights on the tasks of 3D
object detection and semantic segmentation.Comment: 16 page
GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data
Widely-used LiDAR-based 3D object detectors often neglect fundamental
geometric information readily available from the object proposals in their
confidence estimation. This is mostly due to architectural design choices,
which were often adopted from the 2D image domain, where geometric context is
rarely available. In 3D, however, considering the object properties and its
surroundings in a holistic way is important to distinguish between true and
false positive detections, e.g. occluded pedestrians in a group. To address
this, we present GACE, an intuitive and highly efficient method to improve the
confidence estimation of a given black-box 3D object detector. We aggregate
geometric cues of detections and their spatial relationships, which enables us
to properly assess their plausibility and consequently, improve the confidence
estimation. This leads to consistent performance gains over a variety of
state-of-the-art detectors. Across all evaluated detectors, GACE proves to be
especially beneficial for the vulnerable road user classes, i.e. pedestrians
and cyclists.Comment: ICCV 2023, code is available at https://github.com/dschinagl/gac
TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification
Vision and Language Models (VLMs), such as CLIP, have enabled visual
recognition of a potentially unlimited set of categories described by text
prompts. However, for the best visual recognition performance, these models
still require tuning to better fit the data distributions of the downstream
tasks, in order to overcome the domain shift from the web-based pre-training
data. Recently, it has been shown that it is possible to effectively tune VLMs
without any paired data, and in particular to effectively improve VLMs visual
recognition performance using text-only training data generated by Large
Language Models (LLMs). In this paper, we dive deeper into this exciting
text-only VLM training approach and explore ways it can be significantly
further improved taking the specifics of the downstream task into account when
sampling text data from LLMs. In particular, compared to the SOTA text-only VLM
training approach, we demonstrate up to 8.4% performance improvement in (cross)
domain-specific adaptation, up to 8.7% improvement in fine-grained recognition,
and 3.1% overall average improvement in zero-shot classification compared to
strong baselines.Comment: Code is available at: https://github.com/jmiemirza/TA
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
Recently, large-scale pre-trained Vision and Language (VL) models have set a
new state-of-the-art (SOTA) in zero-shot visual classification enabling
open-vocabulary recognition of potentially unlimited set of categories defined
as simple language prompts. However, despite these great advances, the
performance of these zeroshot classifiers still falls short of the results of
dedicated (closed category set) classifiers trained with supervised fine
tuning. In this paper we show, for the first time, how to reduce this gap
without any labels and without any paired VL data, using an unlabeled image
collection and a set of texts auto-generated using a Large Language Model (LLM)
describing the categories of interest and effectively substituting labeled
visual instances of those categories. Using our label-free approach, we are
able to attain significant performance improvements over the zero-shot
performance of the base VL model and other contemporary methods and baselines
on a wide variety of datasets, demonstrating absolute improvement of up to
11.7% (3.8% on average) in the label-free setting. Moreover, despite our
approach being label-free, we observe 1.3% average gains over leading few-shot
prompting baselines that do use 5-shot supervision
Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions
In autonomous driving scenarios, current object detection models show strong
performance when tested in clear weather. However, their performance
deteriorates significantly when tested in degrading weather conditions. In
addition, even when adapted to perform robustly in a sequence of different
weather conditions, they are often unable to perform well in all of them and
suffer from catastrophic forgetting. To efficiently mitigate forgetting, we
propose Domain-Incremental Learning through Activation Matching (DILAM), which
employs unsupervised feature alignment to adapt only the affine parameters of a
clear weather pre-trained network to different weather conditions. We propose
to store these affine parameters as a memory bank for each weather condition
and plug-in their weather-specific parameters during driving (i.e. test time)
when the respective weather conditions are encountered. Our memory bank is
extremely lightweight, since affine parameters account for less than 2% of a
typical object detector. Furthermore, contrary to previous domain-incremental
learning approaches, we do not require the weather label when testing and
propose to automatically infer the weather condition by a majority voting
linear classifier.Comment: Intelligent Vehicle Conference (oral presentation
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
Large scale Vision-Language (VL) models have shown tremendous success in
aligning representations between visual and text modalities. This enables
remarkable progress in zero-shot recognition, image generation & editing, and
many other exciting tasks. However, VL models tend to over-represent objects
while paying much less attention to verbs, and require additional tuning on
video data for best zero-shot action recognition performance. While previous
work relied on large-scale, fully-annotated data, in this work we propose an
unsupervised approach. We adapt a VL model for zero-shot and few-shot action
recognition using a collection of unlabeled videos and an unpaired action
dictionary. Based on that, we leverage Large Language Models and VL models to
build a text bag for each unlabeled video via matching, text expansion and
captioning. We use those bags in a Multiple Instance Learning setup to adapt an
image-text backbone to video data. Although finetuned on unlabeled video data,
our resulting models demonstrate high transferability to numerous unseen
zero-shot downstream tasks, improving the base VL model performance by up to
14\%, and even comparing favorably to fully-supervised baselines in both
zero-shot and few-shot video recognition transfer. The code will be released
later at \url{https://github.com/wlin-at/MAXI}.Comment: Accepted at ICCV 202
MATE: Masked Autoencoders are Online 3D Test-Time Learners
We propose MATE, the first Test-Time-Training (TTT) method designed for 3D
data. It makes deep networks trained in point cloud classification robust to
distribution shifts occurring in test data, which could not be anticipated
during training. Like existing TTT methods, which focused on classifying 2D
images in the presence of distribution shifts at test-time, MATE also leverages
test data for adaptation. Its test-time objective is that of a Masked
Autoencoder: Each test point cloud has a large portion of its points removed
before it is fed to the network, tasked with reconstructing the full point
cloud. Once the network is updated, it is used to classify the point cloud. We
test MATE on several 3D object classification datasets and show that it
significantly improves robustness of deep networks to several types of
corruptions commonly occurring in 3D point clouds. Further, we show that MATE
is very efficient in terms of the fraction of points it needs for the
adaptation. It can effectively adapt given as few as 5% of tokens of each test
sample, which reduces its memory footprint and makes it lightweight. We also
highlight that MATE achieves competitive performance by adapting sparingly on
the test data, which further reduces its computational overhead, making it
ideal for real-time applications.Comment: Minor fix in citation