32 research outputs found
Ranking-aware Uncertainty for Text-guided Image Retrieval
Text-guided image retrieval is to incorporate conditional text to better
capture users' intent. Traditionally, the existing methods focus on minimizing
the embedding distances between the source inputs and the targeted image, using
the provided triplets source image, source text, target
image. However, such triplet optimization may limit the learned
retrieval model to capture more detailed ranking information, e.g., the
triplets are one-to-one correspondences and they fail to account for
many-to-many correspondences arising from semantic diversity in feedback
languages and images. To capture more ranking information, we propose a novel
ranking-aware uncertainty approach to model many-to-many correspondences by
only using the provided triplets. We introduce uncertainty learning to learn
the stochastic ranking list of features. Specifically, our approach mainly
comprises three components: (1) In-sample uncertainty, which aims to capture
semantic diversity using a Gaussian distribution derived from both combined and
target features; (2) Cross-sample uncertainty, which further mines the ranking
information from other samples' distributions; and (3) Distribution
regularization, which aligns the distributional representations of source
inputs and targeted image. Compared to the existing state-of-the-art methods,
our proposed method achieves significant results on two public datasets for
composed image retrieval
Simultaneous Feature Learning and Hash Coding with Deep Neural Networks
Similarity-preserving hashing is a widely-used method for nearest neighbour
search in large-scale image retrieval tasks. For most existing hashing methods,
an image is first encoded as a vector of hand-engineering visual features,
followed by another separate projection or quantization step that generates
binary codes. However, such visual feature vectors may not be optimally
compatible with the coding process, thus producing sub-optimal hashing codes.
In this paper, we propose a deep architecture for supervised hashing, in which
images are mapped into binary codes via carefully designed deep neural
networks. The pipeline of the proposed deep architecture consists of three
building blocks: 1) a sub-network with a stack of convolution layers to produce
the effective intermediate image features; 2) a divide-and-encode module to
divide the intermediate image features into multiple branches, each encoded
into one hash bit; and 3) a triplet ranking loss designed to characterize that
one image is more similar to the second image than to the third one. Extensive
evaluations on several benchmark image datasets show that the proposed
simultaneous feature learning and hash coding pipeline brings substantial
improvements over other state-of-the-art supervised or unsupervised hashing
methods.Comment: This paper has been accepted to IEEE International Conference on
Pattern Recognition and Computer Vision (CVPR), 201
MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model
Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where
an imperceptible perturbation is added to the image that can fool the DNNs.
Diffusion-based adversarial purification focuses on using the diffusion model
to generate a clean image against such adversarial attacks. Unfortunately, the
generative process of the diffusion model is also inevitably affected by
adversarial perturbation since the diffusion model is also a deep network where
its input has adversarial perturbation. In this work, we propose
MimicDiffusion, a new diffusion-based adversarial purification technique, that
directly approximates the generative process of the diffusion model with the
clean image as input. Concretely, we analyze the differences between the guided
terms using the clean image and the adversarial sample. After that, we first
implement MimicDiffusion based on Manhattan distance. Then, we propose two
guidance to purify the adversarial perturbation and approximate the clean
diffusion model. Extensive experiments on three image datasets including
CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including
WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that
MimicDiffusion significantly performs better than the state-of-the-art
baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%,
and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\%
higher, respectively. The code is available in the supplementary material
Improving Entropy-Based Test-Time Adaptation from a Clustering View
Domain shift is a common problem in the realistic world, where training data
and test data follow different data distributions. To deal with this problem,
fully test-time adaptation (TTA) leverages the unlabeled data encountered
during test time to adapt the model. In particular, entropy-based TTA (EBTTA)
methods, which minimize the prediction's entropy on test samples, have shown
great success. In this paper, we introduce a new clustering perspective on the
EBTTA. It is an iterative algorithm: 1) in the assignment step, the forward
process of the EBTTA models is the assignment of labels for these test samples,
and 2) in the updating step, the backward process is the update of the model
via the assigned samples. This new perspective allows us to explore how entropy
minimization influences test-time adaptation. Accordingly, this observation can
guide us to put forward the improvement of EBTTA. We propose to improve EBTTA
from the assignment step and the updating step, where robust label assignment,
similarity-preserving constraint, sample selection, and gradient accumulation
are proposed to explicitly utilize more information. Experimental results
demonstrate that our method can achieve consistent improvements on various
datasets. Code is provided in the supplementary material