Text-based person search aims to retrieve the specified person images given a
textual description. The key to tackling such a challenging task is to learn
powerful multi-modal representations. Towards this, we propose a Relation and
Sensitivity aware representation learning method (RaSa), including two novel
tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For
one thing, existing methods cluster representations of all positive pairs
without distinction and overlook the noise problem caused by the weak positive
pairs where the text and the paired image have noise correspondences, thus
leading to overfitting learning. RA offsets the overfitting risk by introducing
a novel positive relation detection task (i.e., learning to distinguish strong
and weak positive pairs). For another thing, learning invariant representation
under data augmentation (i.e., being insensitive to some transformations) is a
general practice for improving representation's robustness in existing methods.
Beyond that, we encourage the representation to perceive the sensitive
transformation by SA (i.e., learning to detect the replaced words), thus
promoting the representation's robustness. Experiments demonstrate that RaSa
outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in
terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.
Code is available at: https://github.com/Flame-Chasers/RaSa.Comment: Accepted by IJCAI 2023. Code is available at
https://github.com/Flame-Chasers/RaS