622 research outputs found
Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-identification
© 2020, Springer Nature Switzerland AG. Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Due to the large intra-class variations and cross-modality discrepancy with large amount of sample noise, it is difficult to learn discriminative part features. Existing VI-ReID methods instead tend to learn global representations, which have limited discriminability and weak robustness to noisy images. In this paper, we propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID. We propose an intra-modality weighted-part attention module to extract discriminative part-aggregated features, by imposing the domain knowledge on the part relationship mining. To enhance robustness against noisy samples, we introduce cross-modality graph structured attention to reinforce the representation with the contextual relations across the two modalities. We also develop a parameter-free dynamic dual aggregation learning strategy to adaptively integrate the two components in a progressive joint training manner. Extensive experiments demonstrate that DDAG outperforms the state-of-the-art methods under various settings
Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification
Thanks for the cross-modal retrieval techniques, visible-infrared (RGB-IR)
person re-identification (Re-ID) is achieved by projecting them into a common
space, allowing person Re-ID in 24-hour surveillance systems. However, with
respect to the probe-to-gallery, almost all existing RGB-IR based cross-modal
person Re-ID methods focus on image-to-image matching, while the video-to-video
matching which contains much richer spatial- and temporal-information remains
under-explored. In this paper, we primarily study the video-based cross-modal
person Re-ID method. To achieve this task, a video-based RGB-IR dataset is
constructed, in which 927 valid identities with 463,259 frames and 21,863
tracklets captured by 12 RGB/IR cameras are collected. Based on our constructed
dataset, we prove that with the increase of frames in a tracklet, the
performance does meet more enhancement, demonstrating the significance of
video-to-video matching in RGB-IR person Re-ID. Additionally, a novel method is
further proposed, which not only projects two modalities to a modal-invariant
subspace, but also extracts the temporal-memory for motion-invariant. Thanks to
these two strategies, much better results are achieved on our video-based
cross-modal person Re-ID. The code and dataset are released at:
https://github.com/VCMproject233/MITML
Learning Cross-modality Information Bottleneck Representation for Heterogeneous Person Re-Identification
Visible-Infrared person re-identification (VI-ReID) is an important and
challenging task in intelligent video surveillance. Existing methods mainly
focus on learning a shared feature space to reduce the modality discrepancy
between visible and infrared modalities, which still leave two problems
underexplored: information redundancy and modality complementarity. To this
end, properly eliminating the identity-irrelevant information as well as making
up for the modality-specific information are critical and remains a challenging
endeavor. To tackle the above problems, we present a novel mutual information
and modality consensus network, namely CMInfoNet, to extract modality-invariant
identity features with the most representative information and reduce the
redundancies. The key insight of our method is to find an optimal
representation to capture more identity-relevant information and compress the
irrelevant parts by optimizing a mutual information bottleneck trade-off.
Besides, we propose an automatically search strategy to find the most prominent
parts that identify the pedestrians. To eliminate the cross- and intra-modality
variations, we also devise a modality consensus module to align the visible and
infrared modalities for task-specific guidance. Moreover, the global-local
feature representations can also be acquired for key parts discrimination.
Experimental results on four benchmarks, i.e., SYSU-MM01, RegDB,
Occluded-DukeMTMC, Occluded-REID, Partial-REID and Partial\_iLIDS dataset, have
demonstrated the effectiveness of CMInfoNet
Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID)
aims at learning modality-invariant features from unlabeled cross-modality
dataset, which is crucial for practical applications in video surveillance
systems. The key to essentially address the USL-VI-ReID task is to solve the
cross-modality data association problem for further heterogeneous joint
learning. To address this issue, we propose a Dual Optimal Transport Label
Assignment (DOTLA) framework to simultaneously assign the generated labels from
one modality to its counterpart modality. The proposed DOTLA mechanism
formulates a mutual reinforcement and efficient solution to cross-modality data
association, which could effectively reduce the side-effects of some
insufficient and noisy label associations. Besides, we further propose a
cross-modality neighbor consistency guided label refinement and regularization
module, to eliminate the negative effects brought by the inaccurate supervised
signals, under the assumption that the prediction or label distribution of each
example should be similar to its nearest neighbors. Extensive experimental
results on the public SYSU-MM01 and RegDB datasets demonstrate the
effectiveness of the proposed method, surpassing existing state-of-the-art
approach by a large margin of 7.76% mAP on average, which even surpasses some
supervised VI-ReID methods
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
Adversarial Self-Attack Defense and Spatial-Temporal Relation Mining for Visible-Infrared Video Person Re-Identification
In visible-infrared video person re-identification (re-ID), extracting
features not affected by complex scenes (such as modality, camera views,
pedestrian pose, background, etc.) changes, and mining and utilizing motion
information are the keys to solving cross-modal pedestrian identity matching.
To this end, the paper proposes a new visible-infrared video person re-ID
method from a novel perspective, i.e., adversarial self-attack defense and
spatial-temporal relation mining. In this work, the changes of views, posture,
background and modal discrepancy are considered as the main factors that cause
the perturbations of person identity features. Such interference information
contained in the training samples is used as an adversarial perturbation. It
performs adversarial attacks on the re-ID model during the training to make the
model more robust to these unfavorable factors. The attack from the adversarial
perturbation is introduced by activating the interference information contained
in the input samples without generating adversarial samples, and it can be thus
called adversarial self-attack. This design allows adversarial attack and
defense to be integrated into one framework. This paper further proposes a
spatial-temporal information-guided feature representation network to use the
information in video sequences. The network cannot only extract the information
contained in the video-frame sequences but also use the relation of the local
information in space to guide the network to extract more robust features. The
proposed method exhibits compelling performance on large-scale cross-modality
video datasets. The source code of the proposed method will be released at
https://github.com/lhf12278/xxx.Comment: 11 pages,8 figure
Visible-Infrared Person Re-Identification via Patch-Mixed Cross-Modality Learning
Visible-infrared person re-identification (VI-ReID) aims to retrieve images
of the same pedestrian from different modalities, where the challenges lie in
the significant modality discrepancy. To alleviate the modality gap, recent
methods generate intermediate images by GANs, grayscaling, or mixup strategies.
However, these methods could ntroduce extra noise, and the semantic
correspondence between the two modalities is not well learned. In this paper,
we propose a Patch-Mixed Cross-Modality framework (PMCM), where two images of
the same person from two modalities are split into patches and stitched into a
new one for model learning. In this way, the modellearns to recognize a person
through patches of different styles, and the modality semantic correspondence
is directly embodied. With the flexible image generation strategy, the
patch-mixed images freely adjust the ratio of different modality patches, which
could further alleviate the modality imbalance problem. In addition, the
relationship between identity centers among modalities is explored to further
reduce the modality variance, and the global-to-part constraint is introduced
to regularize representation learning of part features. On two VI-ReID
datasets, we report new state-of-the-art performance with the proposed method.Comment: IJCAI2
- …