3,529 research outputs found
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
Text-based person search aims to retrieve the corresponding person images in
an image database by virtue of a describing sentence about the person, which
poses great potential for various applications such as video surveillance.
Extracting visual contents corresponding to the human description is the key to
this cross-modal matching problem. Moreover, correlated images and descriptions
involve different granularities of semantic relevance, which is usually ignored
in previous methods. To exploit the multilevel corresponding visual contents,
we propose a pose-guided multi-granularity attention network (PMA). Firstly, we
propose a coarse alignment network (CA) to select the related image regions to
the global description by a similarity-based attention. To further capture the
phrase-related visual body part, a fine-grained alignment network (FA) is
proposed, which employs pose information to learn latent semantic alignment
between visual body part and textual noun phrase. To verify the effectiveness
of our model, we perform extensive experiments on the CUHK Person Description
Dataset (CUHK-PEDES) which is currently the only available dataset for
text-based person search. Experimental results show that our approach
outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.Comment: published in AAAI2020(oral
- …