1,129 research outputs found
Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search
Text-based person search (TBPS) is a challenging task that aims to search
pedestrian images with the same identity from an image gallery given a query
text. In recent years, TBPS has made remarkable progress and state-of-the-art
methods achieve superior performance by learning local fine-grained
correspondence between images and texts. However, most existing methods rely on
explicitly generated local parts to model fine-grained correspondence between
modalities, which is unreliable due to the lack of contextual information or
the potential introduction of noise. Moreover, existing methods seldom consider
the information inequality problem between modalities caused by image-specific
information. To address these limitations, we propose an efficient joint
Multi-level Alignment Network (MANet) for TBPS, which can learn aligned
image/text feature representations between modalities at multiple levels, and
realize fast and effective person search. Specifically, we first design an
image-specific information suppression module, which suppresses image
background and environmental factors by relation-guided localization and
channel attention filtration respectively. This module effectively alleviates
the information inequality problem and realizes the alignment of information
volume between images and texts. Secondly, we propose an implicit local
alignment module to adaptively aggregate all pixel/word features of image/text
to a set of modality-shared semantic topic centers and implicitly learn the
local fine-grained correspondence between modalities without additional
supervision and cross-modal interactions. And a global alignment is introduced
as a supplement to the local perspective. The cooperation of global and local
alignment modules enables better semantic alignment between modalities.
Extensive experiments on multiple databases demonstrate the effectiveness and
superiority of our MANet
Learning Granularity-Unified Representations for Text-to-Image Person Re-identification
Text-to-image person re-identification (ReID) aims to search for pedestrian
images of an interested identity via textual descriptions. It is challenging
due to both rich intra-modal variations and significant inter-modal gaps.
Existing works usually ignore the difference in feature granularity between the
two modalities, i.e., the visual features are usually fine-grained while
textual features are coarse, which is mainly responsible for the large
inter-modal gaps. In this paper, we propose an end-to-end framework based on
transformers to learn granularity-unified representations for both modalities,
denoted as LGUR. LGUR framework contains two modules: a Dictionary-based
Granularity Alignment (DGA) module and a Prototype-based Granularity
Unification (PGU) module. In DGA, in order to align the granularities of two
modalities, we introduce a Multi-modality Shared Dictionary (MSD) to
reconstruct both visual and textual features. Besides, DGA has two important
factors, i.e., the cross-modality guidance and the foreground-centric
reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set
of shared and learnable prototypes as the queries to extract diverse and
semantically aligned features for both modalities in the granularity-unified
feature space, which further promotes the ReID performance. Comprehensive
experiments show that our LGUR consistently outperforms state-of-the-arts by
large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released
at https://github.com/ZhiyinShao-H/LGUR.Comment: Accepted by ACM Multimedia 202
TVPR: Text-to-Video Person Retrieval and a New Benchmark
Most existing methods for text-based person retrieval focus on text-to-image
person retrieval. Nevertheless, due to the lack of dynamic information provided
by isolated frames, the performance is hampered when the person is obscured in
isolated frames or variable motion details are given in the textual
description. In this paper, we propose a new task called Text-to-Video Person
Retrieval(TVPR) which aims to effectively overcome the limitations of isolated
frames. Since there is no dataset or benchmark that describes person videos
with natural language, we construct a large-scale cross-modal person video
dataset containing detailed natural language annotations, such as person's
appearance, actions and interactions with environment, etc., termed as
Text-to-Video Person Re-identification (TVPReid) dataset, which will be
publicly available. To this end, a Text-to-Video Person Retrieval Network
(TVPRN) is proposed. Specifically, TVPRN acquires video representations by
fusing visual and motion representations of person videos, which can deal with
temporal occlusion and the absence of variable motion details in isolated
frames. Meanwhile, we employ the pre-trained BERT to obtain caption
representations and the relationship between caption and video representations
to reveal the most relevant person videos. To evaluate the effectiveness of the
proposed TVPRN, extensive experiments have been conducted on TVPReid dataset.
To the best of our knowledge, TVPRN is the first successful attempt to use
video for text-based person retrieval task and has achieved state-of-the-art
performance on TVPReid dataset. The TVPReid dataset will be publicly available
to benefit future research
RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search
Text-based person search aims to retrieve the specified person images given a
textual description. The key to tackling such a challenging task is to learn
powerful multi-modal representations. Towards this, we propose a Relation and
Sensitivity aware representation learning method (RaSa), including two novel
tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For
one thing, existing methods cluster representations of all positive pairs
without distinction and overlook the noise problem caused by the weak positive
pairs where the text and the paired image have noise correspondences, thus
leading to overfitting learning. RA offsets the overfitting risk by introducing
a novel positive relation detection task (i.e., learning to distinguish strong
and weak positive pairs). For another thing, learning invariant representation
under data augmentation (i.e., being insensitive to some transformations) is a
general practice for improving representation's robustness in existing methods.
Beyond that, we encourage the representation to perceive the sensitive
transformation by SA (i.e., learning to detect the replaced words), thus
promoting the representation's robustness. Experiments demonstrate that RaSa
outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in
terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.
Code is available at: https://github.com/Flame-Chasers/RaSa.Comment: Accepted by IJCAI 2023. Code is available at
https://github.com/Flame-Chasers/RaS
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
The pre-training task is indispensable for the text-to-image person
re-identification (T2I-ReID) task. However, there are two underlying
inconsistencies between these two tasks that may impact the performance; i)
Data inconsistency. A large domain gap exists between the generic images/texts
used in public pre-trained models and the specific person data in the T2I-ReID
task. This gap is especially severe for texts, as general textual data are
usually unable to describe specific people in fine-grained detail. ii) Training
inconsistency. The processes of pre-training of images and texts are
independent, despite cross-modality learning being critical to T2I-ReID. To
address the above issues, we present a new unified pre-training pipeline
(UniPT) designed specifically for the T2I-ReID task. We first build a
large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual
descriptions of images are automatically generated by the CLIP paradigm using a
divide-conquer-combine strategy. Benefiting from this dataset, we then utilize
a simple vision-and-language pre-training framework to explicitly align the
feature space of the image and text modalities during pre-training. In this
way, the pre-training task and the T2I-ReID task are made consistent with each
other on both data and training levels. Without the need for any bells and
whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%,
60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both
the LUPerson-T dataset and code are available at
https;//github.com/ZhiyinShao-H/UniPT.Comment: accepted by ICCV 202
VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search
Text-based Person Search (TBPS) aims to retrieve images of target pedestrian
indicated by textual descriptions. It is essential for TBPS to extract
fine-grained local features and align them crossing modality. Existing methods
utilize external tools or heavy cross-modal interaction to achieve explicit
alignment of cross-modal fine-grained features, which is inefficient and
time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network
(VGSG) for text-based person search to extract well-aligned fine-grained visual
and textual features. In the proposed VGSG, we develop a Semantic-Group Textual
Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to
extract textual local features under the guidance of visual local clues. In
SGTL, in order to obtain the local textual representation, we group textual
features from the channel dimension based on the semantic cues of language
expression, which encourages similar semantic patterns to be grouped implicitly
without external tools. In VGKT, a vision-guided attention is employed to
extract visual-related textual features, which are inherently aligned with
visual cues and termed vision-guided textual features. Furthermore, we design a
relational knowledge transfer, including a vision-language similarity transfer
and a class probability transfer, to adaptively propagate information of the
vision-guided textual features to semantic-group textual features. With the
help of relational knowledge transfer, VGKT is capable of aligning
semantic-group textual features with corresponding visual features without
external tools and complex pairwise interaction. Experimental results on two
challenging benchmarks demonstrate its superiority over state-of-the-art
methods.Comment: Accepted to IEEE TI
Improving Face Recognition from Caption Supervision with Multi-Granular Contextual Feature Aggregation
We introduce caption-guided face recognition (CGFR) as a new framework to
improve the performance of commercial-off-the-shelf (COTS) face recognition
(FR) systems. In contrast to combining soft biometrics (eg., facial marks,
gender, and age) with face images, in this work, we use facial descriptions
provided by face examiners as a piece of auxiliary information. However, due to
the heterogeneity of the modalities, improving the performance by directly
fusing the textual and facial features is very challenging, as both lie in
different embedding spaces. In this paper, we propose a contextual feature
aggregation module (CFAM) that addresses this issue by effectively exploiting
the fine-grained word-region interaction and global image-caption association.
Specifically, CFAM adopts a self-attention and a cross-attention scheme for
improving the intra-modality and inter-modality relationship between the image
and textual features, respectively. Additionally, we design a textual feature
refinement module (TFRM) that refines the textual features of the pre-trained
BERT encoder by updating the contextual embeddings. This module enhances the
discriminative power of textual features with a cross-modal projection loss and
realigns the word and caption embeddings with visual features by incorporating
a visual-semantic alignment loss. We implemented the proposed CGFR framework on
two face recognition models (ArcFace and AdaFace) and evaluated its performance
on the Multi-Modal CelebA-HQ dataset. Our framework significantly improves the
performance of ArcFace in both 1:1 verification and 1:N identification
protocol.Comment: This article has been accepted for publication in the IEEE
International Joint Conference on Biometrics (IJCB), 202
- …