193 research outputs found
CLIP-based Synergistic Knowledge Transfer for Text-based Person Retrieval
Text-based Person Retrieval (TPR) aims to retrieve the target person images
given a textual query. The primary challenge lies in bridging the substantial
gap between vision and language modalities, especially when dealing with
limited large-scale datasets. In this paper, we introduce a CLIP-based
Synergistic Knowledge Transfer (CSKT) approach for TPR. Specifically, to
explore the CLIP's knowledge on input side, we first propose a Bidirectional
Prompts Transferring (BPT) module constructed by text-to-image and
image-to-text bidirectional prompts and coupling projections. Secondly, Dual
Adapters Transferring (DAT) is designed to transfer knowledge on output side of
Multi-Head Attention (MHA) in vision and language. This synergistic two-way
collaborative mechanism promotes the early-stage feature fusion and efficiently
exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art
approaches across three benchmark datasets when the training parameters merely
account for 7.4% of the entire model, demonstrating its remarkable efficiency,
effectiveness and generalization.Comment: ICASSP2024(accepted). minor typos revision compared to version 1 in
arxi
ODN: Opening the Deep Network for Open-set Action Recognition
In recent years, the performance of action recognition has been significantly
improved with the help of deep neural networks. Most of the existing action
recognition works hold the \textit{closed-set} assumption that all action
categories are known beforehand while deep networks can be well trained for
these categories. However, action recognition in the real world is essentially
an \textit{open-set} problem, namely, it is impossible to know all action
categories beforehand and consequently infeasible to prepare sufficient
training samples for those emerging categories. In this case, applying
closed-set recognition methods will definitely lead to unseen-category errors.
To address this challenge, we propose the Open Deep Network (ODN) for the
open-set action recognition task. Technologically, ODN detects new categories
by applying a multi-class triplet thresholding method, and then dynamically
reconstructs the classification layer and "opens" the deep network by adding
predictors for new categories continually. In order to transfer the learned
knowledge to the new category, two novel methods, Emphasis Initialization and
Allometry Training, are adopted to initialize and incrementally train the new
predictor so that only few samples are needed to fine-tune the model. Extensive
experiments show that ODN can effectively detect and recognize new categories
with little human intervention, thus applicable to the open-set action
recognition tasks in the real world. Moreover, ODN can even achieve comparable
performance to some closed-set methods.Comment: 6 pages, 3 figures, ICME 201
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Vision-and-language navigation (VLN) is the task to enable an embodied agent
to navigate to a remote location following the natural language instruction in
real scenes. Most of the previous approaches utilize the entire features or
object-centric features to represent navigable candidates. However, these
representations are not efficient enough for an agent to perform actions to
arrive the target location. As knowledge provides crucial information which is
complementary to visible content, in this paper, we propose a Knowledge
Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent
navigation ability. Specifically, we first retrieve facts (i.e., knowledge
described by language descriptions) for the navigation views based on local
regions from the constructed knowledge base. The retrieved facts range from
properties of a single object (e.g., color, shape) to relationships between
objects (e.g., action, spatial position), providing crucial information for
VLN. We further present the KERM which contains the purification, fact-aware
interaction, and instruction-guided aggregation modules to integrate visual,
history, instruction, and fact features. The proposed KERM can automatically
select and gather crucial and relevant cues, obtaining more accurate action
prediction. Experimental results on the REVERIE, R2R, and SOON datasets
demonstrate the effectiveness of the proposed method.Comment: Accepted by CVPR 2023. The code is available at
https://github.com/XiangyangLi20/KER
Benign Shortcut for Debiasing: Fair Visual Recognition via Intervention with Shortcut Features
Machine learning models often learn to make predictions that rely on
sensitive social attributes like gender and race, which poses significant
fairness risks, especially in societal applications, such as hiring, banking,
and criminal justice. Existing work tackles this issue by minimizing the
employed information about social attributes in models for debiasing. However,
the high correlation between target task and these social attributes makes
learning on the target task incompatible with debiasing. Given that model bias
arises due to the learning of bias features (\emph{i.e}., gender) that help
target task optimization, we explore the following research question: \emph{Can
we leverage shortcut features to replace the role of bias feature in target
task optimization for debiasing?} To this end, we propose \emph{Shortcut
Debiasing}, to first transfer the target task's learning of bias attributes
from bias features to shortcut features, and then employ causal intervention to
eliminate shortcut features during inference. The key idea of \emph{Shortcut
Debiasing} is to design controllable shortcut features to on one hand replace
bias features in contributing to the target task during the training stage, and
on the other hand be easily removed by intervention during the inference stage.
This guarantees the learning of the target task does not hinder the elimination
of bias features. We apply \emph{Shortcut Debiasing} to several benchmark
datasets, and achieve significant improvements over the state-of-the-art
debiasing methods in both accuracy and fairness.Comment: arXiv admin note: text overlap with arXiv:2211.0125
MixBCT: Towards Self-Adapting Backward-Compatible Training
The exponential growth of data, alongside advancements in model structures
and loss functions, has necessitated the enhancement of image retrieval systems
through the utilization of new models with superior feature embeddings.
However, the expensive process of updating the old retrieval database by
replacing embeddings poses a challenge. As a solution, backward-compatible
training can be employed to avoid the necessity of updating old retrieval
datasets. While previous methods achieved backward compatibility by aligning
prototypes of the old model, they often overlooked the distribution of the old
features, thus limiting their effectiveness when the old model's low quality
leads to a weakly discriminative feature distribution. On the other hand,
instance-based methods like L2 regression take into account the distribution of
old features but impose strong constraints on the performance of the new model
itself. In this paper, we propose MixBCT, a simple yet highly effective
backward-compatible training method that serves as a unified framework for old
models of varying qualities. Specifically, we summarize four constraints that
are essential for ensuring backward compatibility in an ideal scenario, and we
construct a single loss function to facilitate backward-compatible training.
Our approach adaptively adjusts the constraint domain for new features based on
the distribution of the old embeddings. We conducted extensive experiments on
the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the
effectiveness of our method. The experimental results clearly demonstrate its
superiority over previous methods. Code is available at
https://github.com/yuleung/MixBC
Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions
The study of causal relationships between emotions and causes in texts has
recently received much attention. Most works focus on extracting causally
related clauses from documents. However, none of these works has considered
that the causal relationships among the extracted emotion and cause clauses can
only be valid under some specific context clauses. To highlight the context in
such special causal relationships, we propose a new task to determine whether
or not an input pair of emotion and cause has a valid causal relationship under
different contexts and extract the specific context clauses that participate in
the causal relationship. Since the task is new for which no existing dataset is
available, we conduct manual annotation on a benchmark dataset to obtain the
labels for our tasks and the annotations of each context clause's type that can
also be used in some other applications. We adopt negative sampling to
construct the final dataset to balance the number of documents with and without
causal relationships. Based on the constructed dataset, we propose an
end-to-end multi-task framework, where we design two novel and general modules
to handle the two goals of our task. Specifically, we propose a context masking
module to extract the context clauses participating in the causal
relationships. We propose a prediction aggregation module to fine-tune the
prediction results according to whether the input emotion and causes depend on
specific context clauses. Results of extensive comparative experiments and
ablation studies demonstrate the effectiveness and generality of our proposed
framework
ShuffleMix: Improving Representations via Channel-Wise Shuffle of Interpolated Hidden States
Mixup style data augmentation algorithms have been widely adopted in various
tasks as implicit network regularization on representation learning to improve
model generalization, which can be achieved by a linear interpolation of
labeled samples in input or feature space as well as target space. Inspired by
good robustness of alternative dropout strategies against over-fitting on
limited patterns of training samples, this paper introduces a novel concept of
ShuffleMix -- Shuffle of Mixed hidden features, which can be interpreted as a
kind of dropout operation in feature space. Specifically, our ShuffleMix method
favors a simple linear shuffle of randomly selected feature channels for
feature mixup in-between training samples to leverage semantic interpolated
supervision signals, which can be extended to a generalized shuffle operation
via additionally combining linear interpolations of intra-channel features.
Compared to its direct competitor of feature augmentation -- the Manifold
Mixup, the proposed ShuffleMix can gain superior generalization, owing to
imposing more flexible and smooth constraints on generating samples and
achieving regularization effects of channel-wise feature dropout. Experimental
results on several public benchmarking datasets of single-label and multi-label
visual classification tasks can confirm the effectiveness of our method on
consistently improving representations over the state-of-the-art mixup
augmentation
- …