2 research outputs found
AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID
Cross-modal person re-identification (Re-ID) is critical for modern video
surveillance systems. The key challenge is to align inter-modality
representations according to semantic information present for a person and
ignore background information. In this work, we present AXM-Net, a novel CNN
based architecture designed for learning semantically aligned visual and
textual representations. The underlying building block consists of multiple
streams of feature maps coming from visual and textual modalities and a novel
learnable context sharing semantic alignment network. We also propose
complementary intra modal attention learning mechanisms to focus on more
fine-grained local details in the features along with a cross-modal affinity
loss for robust feature matching. Our design is unique in its ability to
implicitly learn feature alignments from data. The entire AXM-Net can be
trained in an end-to-end manner. We report results on both person search and
cross-modal Re-ID tasks. Extensive experimentation validates the proposed
framework and demonstrates its superiority by outperforming the current
state-of-the-art methods by a significant margin
Ăchantillonnage sĂ©mantique et classification des couleurs pour la recherche de personnes dans les vidĂ©os
RĂSUMĂ : Dans le cadre de ce travail, nous nous intĂ©ressons Ă lâapplication de la recherche de personnes par une description par mots-clefs, dans des images issues de vidĂ©os de sĂ©curitĂ©. Avec ce type dâapplication, nous cherchons Ă dĂ©crire les personnes prĂ©sentes dans des vidĂ©os, selon des caractĂ©ristiques saillantes (e.g. couleurs des vĂȘtements), sans nous intĂ©resser Ă leurs identitĂ©s qui sont inconnues a priori. Il sâagit ainsi dâune approche similaire Ă la gĂ©nĂ©ration de "portraits robots", qui peuvent alors ĂȘtre attachĂ©s aux images comme mĂ©ta-donnĂ©es, facilitant ainsi la recherche de profils particuliers dans des vidĂ©os (e.g. on cherche une personne avec un pantalon vert et un gilet bleu). Dans un premier temps, nous identifions comme caractĂ©ristiques saillantes les vĂȘtements, et en particulier nous cherchons Ă dĂ©crire prĂ©cisĂ©ment leurs couleurs. Lâobjectif de notre travail est ainsi le dĂ©veloppement dâune mĂ©thode de classification des couleurs des vĂȘtements dans les images qui se veut le plus proche possible de la perception humaine. Pour cela, nous nous intĂ©ressons Ă la fois au vocabulaire utilisĂ©, qui se doit dâĂȘtre suffisamment gĂ©nĂ©raliste, et aux espaces de couleurs considĂ©rĂ©s, qui doivent pouvoir Ă©muler les propriĂ©tĂ©s de la perception des couleurs humaine.----------ABSTRACT : In this work, we look into the application of person search, in the context of images taken from security cameras. In this type of application, we aim to describe the persons present in videos, according to their characteristics, such as the color of their clothes. The application of person search can be likened to using composite portraits, where persons are described by their semantic parts (such as t-shirt, hat, pants..), and one is looking for persons with specific characteristics. In our case, these characteristics can then be attached to images as metadata, facilitating the search for specific profiles in videos, for example, a person with green pants and a blue shirt. In a first part, we identify the clothes being worn as salient characteristics, and in particular we decide to look into describing very precisely their colors. The goal of our work is thus to propose a color classification method for images, that aims to be as close as possible to human perception