15 research outputs found
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention
recently for its transferable visual representation learning. However, due to
the semantic gap within datasets, CLIP's pre-trained image-text alignment
becomes sub-optimal on downstream tasks, which severely harms its transferring
performance. To better adapt the cross-modality embedding space, we propose to
enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide
textual features of different categories to adaptively explore informative
regions on the image and aggregate visual features by attention mechanisms. In
this way, the texts become visual-guided, namely, more semantically correlated
with downstream images, which greatly benefits the category-wise matching
process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known
classification datasets to demonstrate its effectiveness
Union-net: A deep neural network model adapted to small data sets
In real applications, generally small data sets can be obtained. At present,
most of the practical applications of machine learning use classic models based
on big data to solve the problem of small data sets. However, the deep neural
network model has complex structure, huge model parameters, and training
requires more advanced equipment, which brings certain difficulties to the
application. Therefore, this paper proposes the concept of union convolution,
designing a light deep network model union-net with a shallow network structure
and adapting to small data sets. This model combines convolutional network
units with different combinations of the same input to form a union module.
Each union module is equivalent to a convolutional layer. The serial input and
output between the 3 modules constitute a "3-layer" neural network. The output
of each union module is fused and added as the input of the last convolutional
layer to form a complex network with a 4-layer network structure. It solves the
problem that the deep network model network is too deep and the transmission
path is too long, which causes the loss of the underlying information
transmission. Because the model has fewer model parameters and fewer channels,
it can better adapt to small data sets. It solves the problem that the deep
network model is prone to overfitting in training small data sets. Use the
public data sets cifar10 and 17flowers to conduct multi-classification
experiments. Experiments show that the Union-net model can perform well in
classification of large data sets and small data sets. It has high practical
value in daily application scenarios. The model code is published at
https://github.com/yeaso/union-netComment: 13 pages, 6 figure
Deeply Coupled Cross-Modal Prompt Learning
Recent advancements in multimodal foundation models (e.g., CLIP) have
excelled in zero-shot generalization. Prompt tuning involved in the knowledge
transfer from foundation models to downstream tasks has gained significant
attention recently. Existing prompt-tuning methods in cross-modal learning,
however, either solely focus on language branch, or learn vision-language
interaction in a shallow mechanism. In this context, we propose a Deeply
coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly
accommodates the interplay between vision and language with a Cross-Modal
Prompt Attention (CMPA) mechanism, which enables the mutual exchange of
respective representation through a well-connected multi-head attention module
progressively and strongly. We then conduct comprehensive few-shot learning
experiments on 11 image classification datasets and analyze the robustness to
domain shift as well. Thorough experimental analysis evidently demonstrates the
superb few-shot generalization and compelling domain adaption capacity of a
well-executed DCP. The code can be found at https://github.com/GingL/CMPA.Comment: Accepted by ACL 2023 finding
DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable Kendall's Rank Correlation
Few-shot learning aims to adapt models trained on the base dataset to novel
tasks where the categories are not seen by the model before. This often leads
to a relatively uniform distribution of feature values across channels on novel
classes, posing challenges in determining channel importance for novel tasks.
Standard few-shot learning methods employ geometric similarity metrics such as
cosine similarity and negative Euclidean distance to gauge the semantic
relatedness between two features. However, features with high geometric
similarities may carry distinct semantics, especially in the context of
few-shot learning. In this paper, we demonstrate that the importance ranking of
feature channels is a more reliable indicator for few-shot learning than
geometric similarity metrics. We observe that replacing the geometric
similarity metric with Kendall's rank correlation only during inference is able
to improve the performance of few-shot learning across a wide range of datasets
with different domains. Furthermore, we propose a carefully designed
differentiable loss for meta-training to address the non-differentiability
issue of Kendall's rank correlation. Extensive experiments demonstrate that the
proposed rank-correlation-based approach substantially enhances few-shot
learning performance
When hard negative sampling meets supervised contrastive learning
State-of-the-art image models predominantly follow a two-stage strategy:
pre-training on large datasets and fine-tuning with cross-entropy loss. Many
studies have shown that using cross-entropy can result in sub-optimal
generalisation and stability. While the supervised contrastive loss addresses
some limitations of cross-entropy loss by focusing on intra-class similarities
and inter-class differences, it neglects the importance of hard negative
mining. We propose that models will benefit from performance improvement by
weighting negative samples based on their dissimilarity to positive
counterparts. In this paper, we introduce a new supervised contrastive learning
objective, SCHaNe, which incorporates hard negative sampling during the
fine-tuning phase. Without requiring specialized architectures, additional
data, or extra computational resources, experimental results indicate that
SCHaNe outperforms the strong baseline BEiT-3 in Top-1 accuracy across various
benchmarks, with significant gains of up to in few-shot learning
settings and in full dataset fine-tuning. Importantly, our proposed
objective sets a new state-of-the-art for base models on ImageNet-1k, achieving
an 86.14\% accuracy. Furthermore, we demonstrate that the proposed objective
yields better embeddings and explains the improved effectiveness observed in
our experiments
PVP: Pre-trained Visual Parameter-Efficient Tuning
Large-scale pre-trained transformers have demonstrated remarkable success in
various computer vision tasks. However, it is still highly challenging to fully
fine-tune these models for downstream tasks due to their high computational and
storage costs. Recently, Parameter-Efficient Tuning (PETuning) techniques,
e.g., Visual Prompt Tuning (VPT) and Low-Rank Adaptation (LoRA), have
significantly reduced the computation and storage cost by inserting lightweight
prompt modules into the pre-trained models and tuning these prompt modules with
a small number of trainable parameters, while keeping the transformer backbone
frozen. Although only a few parameters need to be adjusted, most PETuning
methods still require a significant amount of downstream task training data to
achieve good results. The performance is inadequate on low-data regimes,
especially when there are only one or two examples per class. To this end, we
first empirically identify the poor performance is mainly due to the
inappropriate way of initializing prompt modules, which has also been verified
in the pre-trained language models. Next, we propose a Pre-trained Visual
Parameter-efficient (PVP) Tuning framework, which pre-trains the
parameter-efficient tuning modules first and then leverages the pre-trained
modules along with the pre-trained transformer backbone to perform
parameter-efficient tuning on downstream tasks. Experiment results on five
Fine-Grained Visual Classification (FGVC) and VTAB-1k datasets demonstrate that
our proposed method significantly outperforms state-of-the-art PETuning
methods
CrAFT: Compression-Aware Fine-Tuning for Efficient Visual Task Adaptation
Transfer learning has become a popular task adaptation method in the era of
foundation models. However, many foundation models require large storage and
computing resources, which makes off-the-shelf deployment impractical.
Post-training compression techniques such as pruning and quantization can help
lower deployment costs. Unfortunately, the resulting performance degradation
limits the usability and benefits of such techniques. To close this performance
gap, we propose CrAFT, a simple fine-tuning framework that enables effective
post-training network compression. In CrAFT, users simply employ the default
fine-tuning schedule along with sharpness minimization objective,
simultaneously facilitating task adaptation and compression-friendliness.
Contrary to the conventional sharpness minimization techniques, which are
applied during pretraining, the CrAFT approach adds negligible training
overhead as fine-tuning is done in under a couple of minutes or hours with a
single GPU. The effectiveness of CrAFT, which is a general-purpose tool that
can significantly boost one-shot pruning and post-training quantization, is
demonstrated on both convolution-based and attention-based vision foundation
models on a variety of target tasks. The code will be made publicly available.Comment: Preprin
Elucidating and Overcoming the Challenges of Label Noise in Supervised Contrastive Learning
Image classification datasets exhibit a non-negligible fraction of mislabeled
examples, often due to human error when one class superficially resembles
another. This issue poses challenges in supervised contrastive learning (SCL),
where the goal is to cluster together data points of the same class in the
embedding space while distancing those of disparate classes. While such methods
outperform those based on cross-entropy, they are not immune to labeling
errors. However, while the detrimental effects of noisy labels in supervised
learning are well-researched, their influence on SCL remains largely
unexplored. Hence, we analyse the effect of label errors and examine how they
disrupt the SCL algorithm's ability to distinguish between positive and
negative sample pairs. Our analysis reveals that human labeling errors manifest
as easy positive samples in around 99% of cases. We, therefore, propose D-SCL,
a novel Debiased Supervised Contrastive Learning objective designed to mitigate
the bias introduced by labeling errors. We demonstrate that D-SCL consistently
outperforms state-of-the-art techniques for representation learning across
diverse vision benchmarks, offering improved robustness to label errors
Meta Co-Training: Two Views are Better than One
In many practical computer vision scenarios unlabeled data is plentiful, but
labels are scarce and difficult to obtain. As a result, semi-supervised
learning which leverages unlabeled data to boost the performance of supervised
classifiers have received significant attention in recent literature. One major
class of semi-supervised algorithms is co-training. In co-training two
different models leverage different independent and sufficient "views" of the
data to jointly make better predictions. During co-training each model creates
pseudo labels on unlabeled points which are used to improve the other model. We
show that in the common case when independent views are not available we can
construct such views inexpensively using pre-trained models. Co-training on the
constructed views yields a performance improvement over any of the individual
views we construct and performance comparable with recent approaches in
semi-supervised learning, but has some undesirable properties. To alleviate the
issues present with co-training we present Meta Co-Training which is an
extension of the successful Meta Pseudo Labels approach to two views. Our
method achieves new state-of-the-art performance on ImageNet-10% with very few
training resources, as well as outperforming prior semi-supervised work on
several other fine-grained image classification datasets.Comment: 16 pages, 14 figures, 10 tables, for implementation see
https://github.com/JayRothenberger/Meta-Co-Trainin