210,233 research outputs found
End-to-End Comparative Attention Networks for Person Re-identification
Person re-identification across disjoint camera views has been widely applied
in video surveillance yet it is still a challenging problem. One of the major
challenges lies in the lack of spatial and temporal cues, which makes it
difficult to deal with large variations of lighting conditions, viewing angles,
body poses and occlusions. Recently, several deep learning based person
re-identification approaches have been proposed and achieved remarkable
performance. However, most of those approaches extract discriminative features
from the whole frame at one glimpse without differentiating various parts of
the persons to identify. It is essentially important to examine multiple highly
discriminative local regions of the person images in details through multiple
glimpses for dealing with the large appearance variance. In this paper, we
propose a new soft attention based model, i.e., the end to-end Comparative
Attention Network (CAN), specifically tailored for the task of person
re-identification. The end-to-end CAN learns to selectively focus on parts of
pairs of person images after taking a few glimpses of them and adaptively
comparing their appearance. The CAN model is able to learn which parts of
images are relevant for discerning persons and automatically integrates
information from different parts to determine whether a pair of images belongs
to the same person. In other words, our proposed CAN model simulates the human
perception process to verify whether two images are from the same person.
Extensive experiments on three benchmark person re-identification datasets,
including CUHK01, CHUHK03 and Market-1501, clearly demonstrate that our
proposed end-to-end CAN for person re-identification outperforms well
established baselines significantly and offer new state-of-the-art performance
Sparse Label Smoothing Regularization for Person Re-Identification
Person re-identification (re-id) is a cross-camera retrieval task which
establishes a correspondence between images of a person from multiple cameras.
Deep Learning methods have been successfully applied to this problem and have
achieved impressive results. However, these methods require a large amount of
labeled training data. Currently labeled datasets in person re-id are limited
in their scale and manual acquisition of such large-scale datasets from
surveillance cameras is a tedious and labor-intensive task. In this paper, we
propose a framework that performs intelligent data augmentation and assigns
partial smoothing label to generated data. Our approach first exploits the
clustering property of existing person re-id datasets to create groups of
similar objects that model cross-view variations. Each group is then used to
generate realistic images through adversarial training. Our aim is to emphasize
feature similarity between generated samples and the original samples. Finally,
we assign a non-uniform label distribution to the generated samples and define
a regularized loss function for training. The proposed approach tackles two
problems (1) how to efficiently use the generated data and (2) how to address
the over-smoothness problem found in current regularization methods. Extensive
experiments on four larges cale datasets show that our regularization method
significantly improves the Re-ID accuracy compared to existing methods.Comment: 13 pages, 6 figure
Self Attention Grid for Person Re-Identification
In this paper, we present an attention mechanism scheme to improve person
re-identification task. Inspired by biology, we propose Self Attention Grid
(SAG) to discover the most informative parts from a high-resolution image using
its internal representation. In particular, given an input image, the proposed
model is fed with two copies of the same image and consists of two branches.
The upper branch processes the high-resolution image and learns high
dimensional feature representation while the lower branch processes the
low-resolution image and learn a filtering attention grid. We apply a max
filter operation to non-overlapping sub-regions on the high feature
representation before element-wise multiplied with the output of the second
branch. The feature maps of the second branch are subsequently weighted to
reflect the importance of each patch of the grid using a softmax operation. Our
attention module helps the network learn the most discriminative visual
features of multiple image regions and is specifically optimized to attend
feature representation at different levels. Extensive experiments on three
large-scale datasets show that our self-attention mechanism significantly
improves the baseline model and outperforms various state-of-art models by a
large margin.Comment: 10 pages, 4 figures, under revie
Learning Context Graph for Person Search
Person re-identification has achieved great progress with deep convolutional
neural networks. However, most previous methods focus on learning individual
appearance feature embedding, and it is hard for the models to handle difficult
situations with different illumination, large pose variance and occlusion. In
this work, we take a step further and consider employing context information
for person search. For a probe-gallery pair, we first propose a contextual
instance expansion module, which employs a relative attention module to search
and filter useful context information in the scene. We also build a graph
learning framework to effectively employ context pairs to update target
similarity. These two modules are built on top of a joint detection and
instance feature learning framework, which improves the discriminativeness of
the learned features. The proposed framework achieves state-of-the-art
performance on two widely used person search datasets.Comment: To appear in CVPR 201
Harmonious Attention Network for Person Re-Identification
Existing person re-identification (re-id) methods either assume the
availability of well-aligned person bounding box images as model input or rely
on constrained attention selection mechanisms to calibrate misaligned images.
They are therefore sub-optimal for re-id matching in arbitrarily aligned person
images potentially with large human pose variations and unconstrained
auto-detection errors. In this work, we show the advantages of jointly learning
attention selection and feature representation in a Convolutional Neural
Network (CNN) by maximising the complementary information of different levels
of visual attention subject to re-id discriminative learning constraints.
Specifically, we formulate a novel Harmonious Attention CNN (HA-CNN) model for
joint learning of soft pixel attention and hard regional attention along with
simultaneous optimisation of feature representations, dedicated to optimise
person re-id in uncontrolled (misaligned) images. Extensive comparative
evaluations validate the superiority of this new HA-CNN model for person re-id
over a wide variety of state-of-the-art methods on three large-scale benchmarks
including CUHK03, Market-1501, and DukeMTMC-ReID.Comment: Accepted in CVPR 201
Deep Co-attention based Comparators For Relative Representation Learning in Person Re-identification
Person re-identification (re-ID) requires rapid, flexible yet discriminant
representations to quickly generalize to unseen observations on-the-fly and
recognize the same identity across disjoint camera views. Recent effective
methods are developed in a pair-wise similarity learning system to detect a
fixed set of features from distinct regions which are mapped to their vector
embeddings for the distance measuring. However, the most relevant and crucial
parts of each image are detected independently without referring to the
dependency conditioned on one and another. Also, these region based methods
rely on spatial manipulation to position the local features in comparable
similarity measuring. To combat these limitations, in this paper we introduce
the Deep Co-attention based Comparators (DCCs) that fuse the co-dependent
representations of the paired images so as to focus on the relevant parts of
both images and produce their \textit{relative representations}. Given a pair
of pedestrian images to be compared, the proposed model mimics the foveation of
human eyes to detect distinct regions concurrent on both images, namely
co-dependent features, and alternatively attend to relevant regions to fuse
them into the similarity learning. Our comparator is capable of producing
dynamic representations relative to a particular sample every time, and thus
well-suited to the case of re-identifying pedestrians on-the-fly. We perform
extensive experiments to provide the insights and demonstrate the effectiveness
of the proposed DCCs in person re-ID. Moreover, our approach has achieved the
state-of-the-art performance on three benchmark data sets: DukeMTMC-reID
\cite{DukeMTMC}, CUHK03 \cite{FPNN}, and Market-1501 \cite{Market1501}
Deep-Person: Learning Discriminative Deep Features for Person Re-Identification
Recently, many methods of person re-identification (Re-ID) rely on part-based
feature representation to learn a discriminative pedestrian descriptor.
However, the spatial context between these parts is ignored for the independent
extractor to each separate part. In this paper, we propose to apply Long
Short-Term Memory (LSTM) in an end-to-end way to model the pedestrian, seen as
a sequence of body parts from head to foot. Integrating the contextual
information strengthens the discriminative ability of local representation. We
also leverage the complementary information between local and global feature.
Furthermore, we integrate both identification task and ranking task in one
network, where a discriminative embedding and a similarity measurement are
learned concurrently. This results in a novel three-branch framework named
Deep-Person, which learns highly discriminative features for person Re-ID.
Experimental results demonstrate that Deep-Person outperforms the
state-of-the-art methods by a large margin on three challenging datasets
including Market-1501, CUHK03, and DukeMTMC-reID. Specifically, combining with
a re-ranking approach, we achieve a 90.84% mAP on Market-1501 under single
query setting.Comment: Accepted to Pattern Recognition. The code is released:
https://github.com/zydou/Deep-Perso
Re-Identification with Consistent Attentive Siamese Networks
We propose a new deep architecture for person re-identification (re-id).
While re-id has seen much recent progress, spatial localization and
view-invariant representation learning for robust cross-view matching remain
key, unsolved problems. We address these questions by means of a new
attention-driven Siamese learning architecture, called the Consistent Attentive
Siamese Network. Our key innovations compared to existing, competing methods
include (a) a flexible framework design that produces attention with only
identity labels as supervision, (b) explicit mechanisms to enforce attention
consistency among images of the same person, and (c) a new Siamese framework
that integrates attention and attention consistency, producing principled
supervisory signals as well as the first mechanism that can explain the
reasoning behind the Siamese framework's predictions. We conduct extensive
evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets and
report competitive performance.Comment: 10 pages, 8 figures, 3 tables, to appear in CVPR 201
Three-Stream Convolutional Networks for Video-based Person Re-Identification
This paper aims to develop a new architecture that can make full use of the
feature maps of convolutional networks. To this end, we study a number of
methods for video-based person re-identification and make the following
findings: 1) Max-pooling only focuses on the maximum value of a receptive
field, wasting a lot of information. 2) Networks with different streams even
including the one with the worst performance work better than networks with
same streams, where each one has the best performance alone. 3) A full
connection layer at the end of convolutional networks is not necessary. Based
on these studies, we propose a new convolutional architecture termed
Three-Stream Convolutional Networks (TSCN). It first uses different streams to
learn different aspects of feature maps for attentive spatio-temporal fusion of
video, and then merges them together to study some union features. To further
utilize the feature maps, two architectures are designed by using the
strategies of multi-scale and upsampling. Comparative experiments on iLIDS-VID,
PRID-2011 and MARS datasets illustrate that the proposed architectures are
significantly better for feature extraction than the state-of-the-art models
Deeply-Learned Part-Aligned Representations for Person Re-Identification
In this paper, we address the problem of person re-identification, which
refers to associating the persons captured from different cameras. We propose a
simple yet effective human part-aligned representation for handling the body
part misalignment problem. Our approach decomposes the human body into regions
(parts) which are discriminative for person matching, accordingly computes the
representations over the regions, and aggregates the similarities computed
between the corresponding regions of a pair of probe and gallery images as the
overall matching score. Our formulation, inspired by attention models, is a
deep neural network modeling the three steps together, which is learnt through
minimizing the triplet loss function without requiring body part labeling
information. Unlike most existing deep learning algorithms that learn a global
or spatial partition-based local representation, our approach performs human
body partition, and thus is more robust to pose changes and various human
spatial distributions in the person bounding box. Our approach shows
state-of-the-art results over standard datasets, Market-, CUHK,
CUHK and VIPeR.Comment: Accepted by ICCV 201
- …