1,056 research outputs found
End-to-End Deep Kronecker-Product Matching for Person Re-identification
Person re-identification aims to robustly measure similarities between person
images. The significant variation of person poses and viewing angles challenges
for accurate person re-identification. The spatial layout and correspondences
between query person images are vital information for tackling this problem but
are ignored by most state-of-the-art methods. In this paper, we propose a novel
Kronecker Product Matching module to match feature maps of different persons in
an end-to-end trainable deep neural network. A novel feature soft warping
scheme is designed for aligning the feature maps based on matching results,
which is shown to be crucial for achieving superior accuracy. The multi-scale
features based on hourglass-like networks and self-residual attention are also
exploited to further boost the re-identification performance. The proposed
approach outperforms state-of-the-art methods on the Market-1501, CUHK03, and
DukeMTMC datasets, which demonstrates the effectiveness and generalization
ability of our proposed approach.Comment: CVPR 2018 poste
Self Attention Grid for Person Re-Identification
In this paper, we present an attention mechanism scheme to improve person
re-identification task. Inspired by biology, we propose Self Attention Grid
(SAG) to discover the most informative parts from a high-resolution image using
its internal representation. In particular, given an input image, the proposed
model is fed with two copies of the same image and consists of two branches.
The upper branch processes the high-resolution image and learns high
dimensional feature representation while the lower branch processes the
low-resolution image and learn a filtering attention grid. We apply a max
filter operation to non-overlapping sub-regions on the high feature
representation before element-wise multiplied with the output of the second
branch. The feature maps of the second branch are subsequently weighted to
reflect the importance of each patch of the grid using a softmax operation. Our
attention module helps the network learn the most discriminative visual
features of multiple image regions and is specifically optimized to attend
feature representation at different levels. Extensive experiments on three
large-scale datasets show that our self-attention mechanism significantly
improves the baseline model and outperforms various state-of-art models by a
large margin.Comment: 10 pages, 4 figures, under revie
FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification
Person re-identification (reID) is an important task that requires to
retrieve a person's images from an image dataset, given one image of the person
of interest. For learning robust person features, the pose variation of person
images is one of the key challenges. Existing works targeting the problem
either perform human alignment, or learn human-region-based representations.
Extra pose information and computational cost is generally required for
inference. To solve this issue, a Feature Distilling Generative Adversarial
Network (FD-GAN) is proposed for learning identity-related and pose-unrelated
representations. It is a novel framework based on a Siamese structure with
multiple novel discriminators on human poses and identities. In addition to the
discriminators, a novel same-pose loss is also integrated, which requires
appearance of a same person's generated images to be similar. After learning
pose-unrelated person features with pose guidance, no auxiliary pose
information and additional computational cost is required during testing. Our
proposed FD-GAN achieves state-of-the-art performance on three person reID
datasets, which demonstrates that the effectiveness and robust feature
distilling capability of the proposed FD-GAN.Comment: Accepted in Proceedings of 32nd Conference on Neural Information
Processing Systems (NeurIPS 2018). Code available:
https://github.com/yxgeee/FD-GA
Person Re-identification in Videos by Analyzing Spatio-Temporal Tubes
Typical person re-identification frameworks search for k best matches in a
gallery of images that are often collected in varying conditions. The gallery
may contain image sequences when re-identification is done on videos. However,
such a process is time consuming as re-identification has to be carried out
multiple times. In this paper, we extract spatio-temporal sequences of frames
(referred to as tubes) of moving persons and apply a multi-stage processing to
match a given query tube with a gallery of stored tubes recorded through other
cameras. Initially, we apply a binary classifier to remove noisy images from
the input query tube. In the next step, we use a key-pose detection-based query
minimization. This reduces the length of the query tube by removing redundant
frames. Finally, a 3-stage hierarchical re-identification framework is used to
rank the output tubes as per the matching scores. Experiments with publicly
available video re-identification datasets reveal that our framework is better
than state-of-the-art methods. It ranks the tubes with an increased CMC
accuracy of 6-8% across multiple datasets. Also, our method significantly
reduces the number of false positives. A new video re-identification dataset,
named Tube-based Reidentification Video Dataset (TRiViD), has been prepared
with an aim to help the re-identification research communit
Deep Group-shuffling Random Walk for Person Re-identification
Person re-identification aims at finding a person of interest in an image
gallery by comparing the probe image of this person with all the gallery
images. It is generally treated as a retrieval problem, where the affinities
between the probe image and gallery images (P2G affinities) are used to rank
the retrieved gallery images. However, most existing methods only consider P2G
affinities but ignore the affinities between all the gallery images (G2G
affinity). Some frameworks incorporated G2G affinities into the testing
process, which is not end-to-end trainable for deep neural networks. In this
paper, we propose a novel group-shuffling random walk network for fully
utilizing the affinity information between gallery images in both the training
and testing processes. The proposed approach aims at end-to-end refining the
P2G affinities based on G2G affinity information with a simple yet effective
matrix operation, which can be integrated into deep neural networks. Feature
grouping and group shuffle are also proposed to apply rich supervisions for
learning better person features. The proposed approach outperforms
state-of-the-art methods on the Market-1501, CUHK03, and DukeMTMC datasets by
large margins, which demonstrate the effectiveness of our approach.Comment: CVPR 2018 poste
Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association
Person re-identification is an important task that requires learning
discriminative visual features for distinguishing different person identities.
Diverse auxiliary information has been utilized to improve the visual feature
learning. In this paper, we propose to exploit natural language description as
additional training supervisions for effective visual features. Compared with
other auxiliary information, language can describe a specific person from more
compact and semantic visual aspects, thus is complementary to the pixel-level
image data. Our method not only learns better global visual feature with the
supervision of the overall description but also enforces semantic consistencies
between local visual and linguistic features, which is achieved by building
global and local image-language associations. The global image-language
association is established according to the identity labels, while the local
association is based upon the implicit correspondences between image regions
and noun phrases. Extensive experiments demonstrate the effectiveness of
employing language as training supervisions with the two association schemes.
Our method achieves state-of-the-art performance without utilizing any
auxiliary information during testing and shows better performance than other
joint embedding methods for the image-language association.Comment: ECC
TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification
Transformers have recently gained increasing attention in computer vision.
However, existing studies mostly use Transformers for feature representation
learning, e.g. for image classification and dense predictions, and the
generalizability of Transformers is unknown. In this work, we further
investigate the possibility of applying Transformers for image matching and
metric learning given pairs of images. We find that the Vision Transformer
(ViT) and the vanilla Transformer with decoders are not adequate for image
matching due to their lack of image-to-image attention. Thus, we further design
two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery
cross-attention in the vanilla Transformer. The latter improves the
performance, but it is still limited. This implies that the attention mechanism
in Transformers is primarily designed for global feature aggregation, which is
not naturally suitable for image matching. Accordingly, we propose a new
simplified decoder, which drops the full attention implementation with the
softmax weighting, keeping only the query-key similarity computation.
Additionally, global max pooling and a multilayer perceptron (MLP) head are
applied to decode the matching result. This way, the simplified decoder is
computationally more efficient, while at the same time more effective for image
matching. The proposed method, called TransMatcher, achieves state-of-the-art
performance in generalizable person re-identification, with up to 6.1% and 5.7%
performance gains in Rank-1 and mAP, respectively, on several popular datasets.
Code is available at https://github.com/ShengcaiLiao/QAConv.Comment: Accepted by NeurIPS 202
Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification
Recent studies show that, both explicit deep feature matching as well as
large-scale and diverse training data can significantly improve the
generalization of person re-identification. However, the efficiency of learning
deep matchers on large-scale data has not yet been adequately studied. Though
learning with classification parameters or class memory is a popular way, it
incurs large memory and computational costs. In contrast, pairwise deep metric
learning within mini batches would be a better choice. However, the most
popular random sampling method, the well-known PK sampler, is not informative
and efficient for deep metric learning. Though online hard example mining has
improved the learning efficiency to some extent, the mining in mini batches
after random sampling is still limited. This inspires us to explore the use of
hard example mining earlier, in the data sampling stage. To do so, in this
paper, we propose an efficient mini-batch sampling method, called graph
sampling (GS), for large-scale deep metric learning. The basic idea is to build
a nearest neighbor relationship graph for all classes at the beginning of each
epoch. Then, each mini batch is composed of a randomly selected class and its
nearest neighboring classes so as to provide informative and challenging
examples for learning. Together with an adapted competitive baseline, we
improve the previous state of the art in generalizable person re-identification
significantly, by up to 24% in Rank-1 and 13.8% in mAP. Besides, the proposed
method also outperforms the competitive baseline by up to 6.2% in Rank-1 and
5.3% in mAP. Meanwhile, the training time is significantly reduced by up to
five times, e.g. from 12.2 hours to 2.3 hours when training on a large-scale
dataset with 8,000 identities. Code is available at
https://github.com/ShengcaiLiao/QAConv
Learning Incremental Triplet Margin for Person Re-identification
Person re-identification (ReID) aims to match people across multiple
non-overlapping video cameras deployed at different locations. To address this
challenging problem, many metric learning approaches have been proposed, among
which triplet loss is one of the state-of-the-arts. In this work, we explore
the margin between positive and negative pairs of triplets and prove that large
margin is beneficial. In particular, we propose a novel multi-stage training
strategy which learns incremental triplet margin and improves triplet loss
effectively. Multiple levels of feature maps are exploited to make the learned
features more discriminative. Besides, we introduce global hard identity
searching method to sample hard identities when generating a training batch.
Extensive experiments on Market-1501, CUHK03, and DukeMTMCreID show that our
approach yields a performance boost and outperforms most existing
state-of-the-art methods.Comment: accepted by AAAI19 as spotligh
An End-to-End Foreground-Aware Network for Person Re-Identification
Person re-identification is a crucial task of identifying pedestrians of
interest across multiple surveillance camera views. In person
re-identification, a pedestrian is usually represented with features extracted
from a rectangular image region that inevitably contains the scene background,
which incurs ambiguity to distinguish different pedestrians and degrades the
accuracy. To this end, we propose an end-to-end foreground-aware network to
discriminate foreground from background by learning a soft mask for person
re-identification. In our method, in addition to the pedestrian ID as
supervision for foreground, we introduce the camera ID of each pedestrian image
for background modeling. The foreground branch and the background branch are
optimized collaboratively. By presenting a target attention loss, the
pedestrian features extracted from the foreground branch become more
insensitive to the backgrounds, which greatly reduces the negative impacts of
changing backgrounds on matching an identical across different camera views.
Notably, in contrast to existing methods, our approach does not require any
additional dataset to train a human landmark detector or a segmentation model
for locating the background regions. The experimental results conducted on
three challenging datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17,
demonstrate the effectiveness of our approach.Comment: Accepted to IEEE Transactions on Image Processing (TIP), 202
- …