5 research outputs found
TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification
Vision and Language Models (VLMs), such as CLIP, have enabled visual
recognition of a potentially unlimited set of categories described by text
prompts. However, for the best visual recognition performance, these models
still require tuning to better fit the data distributions of the downstream
tasks, in order to overcome the domain shift from the web-based pre-training
data. Recently, it has been shown that it is possible to effectively tune VLMs
without any paired data, and in particular to effectively improve VLMs visual
recognition performance using text-only training data generated by Large
Language Models (LLMs). In this paper, we dive deeper into this exciting
text-only VLM training approach and explore ways it can be significantly
further improved taking the specifics of the downstream task into account when
sampling text data from LLMs. In particular, compared to the SOTA text-only VLM
training approach, we demonstrate up to 8.4% performance improvement in (cross)
domain-specific adaptation, up to 8.7% improvement in fine-grained recognition,
and 3.1% overall average improvement in zero-shot classification compared to
strong baselines.Comment: Code is available at: https://github.com/jmiemirza/TA
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
Recently, large-scale pre-trained Vision and Language (VL) models have set a
new state-of-the-art (SOTA) in zero-shot visual classification enabling
open-vocabulary recognition of potentially unlimited set of categories defined
as simple language prompts. However, despite these great advances, the
performance of these zeroshot classifiers still falls short of the results of
dedicated (closed category set) classifiers trained with supervised fine
tuning. In this paper we show, for the first time, how to reduce this gap
without any labels and without any paired VL data, using an unlabeled image
collection and a set of texts auto-generated using a Large Language Model (LLM)
describing the categories of interest and effectively substituting labeled
visual instances of those categories. Using our label-free approach, we are
able to attain significant performance improvements over the zero-shot
performance of the base VL model and other contemporary methods and baselines
on a wide variety of datasets, demonstrating absolute improvement of up to
11.7% (3.8% on average) in the label-free setting. Moreover, despite our
approach being label-free, we observe 1.3% average gains over leading few-shot
prompting baselines that do use 5-shot supervision
Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions
In autonomous driving scenarios, current object detection models show strong
performance when tested in clear weather. However, their performance
deteriorates significantly when tested in degrading weather conditions. In
addition, even when adapted to perform robustly in a sequence of different
weather conditions, they are often unable to perform well in all of them and
suffer from catastrophic forgetting. To efficiently mitigate forgetting, we
propose Domain-Incremental Learning through Activation Matching (DILAM), which
employs unsupervised feature alignment to adapt only the affine parameters of a
clear weather pre-trained network to different weather conditions. We propose
to store these affine parameters as a memory bank for each weather condition
and plug-in their weather-specific parameters during driving (i.e. test time)
when the respective weather conditions are encountered. Our memory bank is
extremely lightweight, since affine parameters account for less than 2% of a
typical object detector. Furthermore, contrary to previous domain-incremental
learning approaches, we do not require the weather label when testing and
propose to automatically infer the weather condition by a majority voting
linear classifier.Comment: Intelligent Vehicle Conference (oral presentation