912 research outputs found
Deep Fishing: Gradient Features from Deep Nets
Convolutional Networks (ConvNets) have recently improved image recognition
performance thanks to end-to-end learning of deep feed-forward models from raw
pixels. Deep learning is a marked departure from the previous state of the art,
the Fisher Vector (FV), which relied on gradient-based encoding of local
hand-crafted features. In this paper, we discuss a novel connection between
these two approaches. First, we show that one can derive gradient
representations from ConvNets in a similar fashion to the FV. Second, we show
that this gradient representation actually corresponds to a structured matrix
that allows for efficient similarity computation. We experimentally study the
benefits of transferring this representation over the outputs of ConvNet
layers, and find consistent improvements on the Pascal VOC 2007 and 2012
datasets.Comment: To appear at BMVC 201
3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification
In this paper, we introduce a global video representation to video-based
person re-identification (re-ID) that aggregates local 3D features across the
entire video extent. Most of the existing methods rely on 2D convolutional
networks (ConvNets) to extract frame-wise deep features which are pooled
temporally to generate the video-level representations. However, 2D ConvNets
lose temporal input information immediately after the convolution, and a
separate temporal pooling is limited in capturing human motion in shorter
sequences. To this end, we present a \textit{global} video representation (3D
PersonVLAD), complementary to 3D ConvNets as a novel layer to capture the
appearance and motion dynamics in full-length videos. However, encoding each
video frame in its entirety and computing an aggregate global representation
across all frames is tremendously challenging due to occlusions and
misalignments. To resolve this, our proposed network is further augmented with
3D part alignment module to learn local features through soft-attention module.
These attended features are statistically aggregated to yield
identity-discriminative representations. Our global 3D features are
demonstrated to achieve state-of-the-art results on three benchmark datasets:
MARS \cite{MARS}, iLIDS-VID \cite{VideoRanking}, and PRID 2011Comment: Accepted to appear at IEEE Transactions on Neural Networks and
Learning System
- …