2,241 research outputs found
3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification
In this paper, we introduce a global video representation to video-based
person re-identification (re-ID) that aggregates local 3D features across the
entire video extent. Most of the existing methods rely on 2D convolutional
networks (ConvNets) to extract frame-wise deep features which are pooled
temporally to generate the video-level representations. However, 2D ConvNets
lose temporal input information immediately after the convolution, and a
separate temporal pooling is limited in capturing human motion in shorter
sequences. To this end, we present a \textit{global} video representation (3D
PersonVLAD), complementary to 3D ConvNets as a novel layer to capture the
appearance and motion dynamics in full-length videos. However, encoding each
video frame in its entirety and computing an aggregate global representation
across all frames is tremendously challenging due to occlusions and
misalignments. To resolve this, our proposed network is further augmented with
3D part alignment module to learn local features through soft-attention module.
These attended features are statistically aggregated to yield
identity-discriminative representations. Our global 3D features are
demonstrated to achieve state-of-the-art results on three benchmark datasets:
MARS \cite{MARS}, iLIDS-VID \cite{VideoRanking}, and PRID 2011Comment: Accepted to appear at IEEE Transactions on Neural Networks and
Learning System
Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting
For person re-identification, existing deep networks often focus on
representation learning. However, without transfer learning, the learned model
is fixed as is, which is not adaptable for handling various unseen scenarios.
In this paper, beyond representation learning, we consider how to formulate
person image matching directly in deep feature maps. We treat image matching as
finding local correspondences in feature maps, and construct query-adaptive
convolution kernels on the fly to achieve local matching. In this way, the
matching process and results are interpretable, and this explicit matching is
more generalizable than representation features to unseen scenarios, such as
unknown misalignments, pose or viewpoint changes. To facilitate end-to-end
training of this architecture, we further build a class memory module to cache
feature maps of the most recent samples of each class, so as to compute image
matching losses for metric learning. Through direct cross-dataset evaluation,
the proposed Query-Adaptive Convolution (QAConv) method gains large
improvements over popular learning methods (about 10%+ mAP), and achieves
comparable results to many transfer learning methods. Besides, a model-free
temporal cooccurrence based score weighting method called TLift is proposed,
which improves the performance to a further extent, achieving state-of-the-art
results in cross-dataset person re-identification. Code is available at
https://github.com/ShengcaiLiao/QAConv.Comment: This is the ECCV 2020 version, including the appendi
Person Re-Identification by Discriminative Selection in Video Ranking
Current person re-identification (ReID) methods typically rely on single-frame imagery features, whilst ignoring space-time information from image sequences often available in the practical surveillance scenarios. Single-frame (single-shot) based visual appearance matching is inherently limited for person ReID in public spaces due to the challenging visual ambiguity and uncertainty arising from non-overlapping camera views where viewing condition changes can cause significant people appearance variations. In this work, we present a novel model to automatically select the most discriminative video fragments from noisy/incomplete image sequences of people from which reliable space-time and appearance features can be computed, whilst simultaneously learning a video ranking function for person ReID. Using the PRID2011, iLIDS-VID, and HDA+ image sequence datasets, we extensively conducted comparative evaluations to demonstrate the advantages of the proposed model over contemporary gait recognition, holistic image sequence matching and state-of-the-art single-/multi-shot ReID methods
Spontaneous Subtle Expression Detection and Recognition based on Facial Strain
Optical strain is an extension of optical flow that is capable of quantifying
subtle changes on faces and representing the minute facial motion intensities
at the pixel level. This is computationally essential for the relatively new
field of spontaneous micro-expression, where subtle expressions can be
technically challenging to pinpoint. In this paper, we present a novel method
for detecting and recognizing micro-expressions by utilizing facial optical
strain magnitudes to construct optical strain features and optical strain
weighted features. The two sets of features are then concatenated to form the
resultant feature histogram. Experiments were performed on the CASME II and
SMIC databases. We demonstrate on both databases, the usefulness of optical
strain information and more importantly, that our best approaches are able to
outperform the original baseline results for both detection and recognition
tasks. A comparison of the proposed method with other existing spatio-temporal
feature extraction approaches is also presented.Comment: 21 pages (including references), single column format, accepted to
Signal Processing: Image Communication journa
- …