2,378 research outputs found
3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification
In this paper, we introduce a global video representation to video-based
person re-identification (re-ID) that aggregates local 3D features across the
entire video extent. Most of the existing methods rely on 2D convolutional
networks (ConvNets) to extract frame-wise deep features which are pooled
temporally to generate the video-level representations. However, 2D ConvNets
lose temporal input information immediately after the convolution, and a
separate temporal pooling is limited in capturing human motion in shorter
sequences. To this end, we present a \textit{global} video representation (3D
PersonVLAD), complementary to 3D ConvNets as a novel layer to capture the
appearance and motion dynamics in full-length videos. However, encoding each
video frame in its entirety and computing an aggregate global representation
across all frames is tremendously challenging due to occlusions and
misalignments. To resolve this, our proposed network is further augmented with
3D part alignment module to learn local features through soft-attention module.
These attended features are statistically aggregated to yield
identity-discriminative representations. Our global 3D features are
demonstrated to achieve state-of-the-art results on three benchmark datasets:
MARS \cite{MARS}, iLIDS-VID \cite{VideoRanking}, and PRID 2011Comment: Accepted to appear at IEEE Transactions on Neural Networks and
Learning System
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
- …