5,744 research outputs found
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
Text-based person search aims to retrieve the corresponding person images in
an image database by virtue of a describing sentence about the person, which
poses great potential for various applications such as video surveillance.
Extracting visual contents corresponding to the human description is the key to
this cross-modal matching problem. Moreover, correlated images and descriptions
involve different granularities of semantic relevance, which is usually ignored
in previous methods. To exploit the multilevel corresponding visual contents,
we propose a pose-guided multi-granularity attention network (PMA). Firstly, we
propose a coarse alignment network (CA) to select the related image regions to
the global description by a similarity-based attention. To further capture the
phrase-related visual body part, a fine-grained alignment network (FA) is
proposed, which employs pose information to learn latent semantic alignment
between visual body part and textual noun phrase. To verify the effectiveness
of our model, we perform extensive experiments on the CUHK Person Description
Dataset (CUHK-PEDES) which is currently the only available dataset for
text-based person search. Experimental results show that our approach
outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.Comment: published in AAAI2020(oral
VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search
Text-based Person Search (TBPS) aims to retrieve images of target pedestrian
indicated by textual descriptions. It is essential for TBPS to extract
fine-grained local features and align them crossing modality. Existing methods
utilize external tools or heavy cross-modal interaction to achieve explicit
alignment of cross-modal fine-grained features, which is inefficient and
time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network
(VGSG) for text-based person search to extract well-aligned fine-grained visual
and textual features. In the proposed VGSG, we develop a Semantic-Group Textual
Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to
extract textual local features under the guidance of visual local clues. In
SGTL, in order to obtain the local textual representation, we group textual
features from the channel dimension based on the semantic cues of language
expression, which encourages similar semantic patterns to be grouped implicitly
without external tools. In VGKT, a vision-guided attention is employed to
extract visual-related textual features, which are inherently aligned with
visual cues and termed vision-guided textual features. Furthermore, we design a
relational knowledge transfer, including a vision-language similarity transfer
and a class probability transfer, to adaptively propagate information of the
vision-guided textual features to semantic-group textual features. With the
help of relational knowledge transfer, VGKT is capable of aligning
semantic-group textual features with corresponding visual features without
external tools and complex pairwise interaction. Experimental results on two
challenging benchmarks demonstrate its superiority over state-of-the-art
methods.Comment: Accepted to IEEE TI
Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search
Text-based person search (TBPS) is a challenging task that aims to search
pedestrian images with the same identity from an image gallery given a query
text. In recent years, TBPS has made remarkable progress and state-of-the-art
methods achieve superior performance by learning local fine-grained
correspondence between images and texts. However, most existing methods rely on
explicitly generated local parts to model fine-grained correspondence between
modalities, which is unreliable due to the lack of contextual information or
the potential introduction of noise. Moreover, existing methods seldom consider
the information inequality problem between modalities caused by image-specific
information. To address these limitations, we propose an efficient joint
Multi-level Alignment Network (MANet) for TBPS, which can learn aligned
image/text feature representations between modalities at multiple levels, and
realize fast and effective person search. Specifically, we first design an
image-specific information suppression module, which suppresses image
background and environmental factors by relation-guided localization and
channel attention filtration respectively. This module effectively alleviates
the information inequality problem and realizes the alignment of information
volume between images and texts. Secondly, we propose an implicit local
alignment module to adaptively aggregate all pixel/word features of image/text
to a set of modality-shared semantic topic centers and implicitly learn the
local fine-grained correspondence between modalities without additional
supervision and cross-modal interactions. And a global alignment is introduced
as a supplement to the local perspective. The cooperation of global and local
alignment modules enables better semantic alignment between modalities.
Extensive experiments on multiple databases demonstrate the effectiveness and
superiority of our MANet
Text-based Person Search in Full Images via Semantic-Driven Proposal Generation
Finding target persons in full scene images with a query of text description
has important practical applications in intelligent video surveillance.However,
different from the real-world scenarios where the bounding boxes are not
available, existing text-based person retrieval methods mainly focus on the
cross modal matching between the query text descriptions and the gallery of
cropped pedestrian images. To close the gap, we study the problem of text-based
person search in full images by proposing a new end-to-end learning framework
which jointly optimize the pedestrian detection, identification and
visual-semantic feature embedding tasks. To take full advantage of the query
text, the semantic features are leveraged to instruct the Region Proposal
Network to pay more attention to the text-described proposals. Besides, a
cross-scale visual-semantic embedding mechanism is utilized to improve the
performance. To validate the proposed method, we collect and annotate two
large-scale benchmark datasets based on the widely adopted image-based person
search datasets CUHK-SYSU and PRW. Comprehensive experiments are conducted on
the two datasets and compared with the baseline methods, our method achieves
the state-of-the-art performance
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
The pre-training task is indispensable for the text-to-image person
re-identification (T2I-ReID) task. However, there are two underlying
inconsistencies between these two tasks that may impact the performance; i)
Data inconsistency. A large domain gap exists between the generic images/texts
used in public pre-trained models and the specific person data in the T2I-ReID
task. This gap is especially severe for texts, as general textual data are
usually unable to describe specific people in fine-grained detail. ii) Training
inconsistency. The processes of pre-training of images and texts are
independent, despite cross-modality learning being critical to T2I-ReID. To
address the above issues, we present a new unified pre-training pipeline
(UniPT) designed specifically for the T2I-ReID task. We first build a
large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual
descriptions of images are automatically generated by the CLIP paradigm using a
divide-conquer-combine strategy. Benefiting from this dataset, we then utilize
a simple vision-and-language pre-training framework to explicitly align the
feature space of the image and text modalities during pre-training. In this
way, the pre-training task and the T2I-ReID task are made consistent with each
other on both data and training levels. Without the need for any bells and
whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%,
60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both
the LUPerson-T dataset and code are available at
https;//github.com/ZhiyinShao-H/UniPT.Comment: accepted by ICCV 202
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
- …