2,823 research outputs found
Person Search with Natural Language Description
Searching persons in large-scale image databases with the query of natural
language description has important applications in video surveillance. Existing
methods mainly focused on searching persons with image-based or attribute-based
queries, which have major limitations for a practical usage. In this paper, we
study the problem of person search with natural language description. Given the
textual description of a person, the algorithm of the person search is required
to rank all the samples in the person database then retrieve the most relevant
sample corresponding to the queried description. Since there is no person
dataset or benchmark with textual description available, we collect a
large-scale person description dataset with detailed natural language
annotations and person samples from various sources, termed as CUHK Person
Description Dataset (CUHK-PEDES). A wide range of possible models and baselines
have been evaluated and compared on the person search benchmark. An Recurrent
Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to
establish the state-of-the art performance on person search
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
Text-based person search aims to retrieve the corresponding person images in
an image database by virtue of a describing sentence about the person, which
poses great potential for various applications such as video surveillance.
Extracting visual contents corresponding to the human description is the key to
this cross-modal matching problem. Moreover, correlated images and descriptions
involve different granularities of semantic relevance, which is usually ignored
in previous methods. To exploit the multilevel corresponding visual contents,
we propose a pose-guided multi-granularity attention network (PMA). Firstly, we
propose a coarse alignment network (CA) to select the related image regions to
the global description by a similarity-based attention. To further capture the
phrase-related visual body part, a fine-grained alignment network (FA) is
proposed, which employs pose information to learn latent semantic alignment
between visual body part and textual noun phrase. To verify the effectiveness
of our model, we perform extensive experiments on the CUHK Person Description
Dataset (CUHK-PEDES) which is currently the only available dataset for
text-based person search. Experimental results show that our approach
outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.Comment: published in AAAI2020(oral
- …