49,800 research outputs found
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
Spatial and Temporal Mutual Promotion for Video-based Person Re-identification
Video-based person re-identification is a crucial task of matching video
sequences of a person across multiple camera views. Generally, features
directly extracted from a single frame suffer from occlusion, blur,
illumination and posture changes. This leads to false activation or missing
activation in some regions, which corrupts the appearance and motion
representation. How to explore the abundant spatial-temporal information in
video sequences is the key to solve this problem. To this end, we propose a
Refining Recurrent Unit (RRU) that recovers the missing parts and suppresses
noisy parts of the current frame's features by referring historical frames.
With RRU, the quality of each frame's appearance representation is improved.
Then we use the Spatial-Temporal clues Integration Module (STIM) to mine the
spatial-temporal information from those upgraded features. Meanwhile, the
multi-level training objective is used to enhance the capability of RRU and
STIM. Through the cooperation of those modules, the spatial and temporal
features mutually promote each other and the final spatial-temporal feature
representation is more discriminative and robust. Extensive experiments are
conducted on three challenging datasets, i.e., iLIDS-VID, PRID-2011 and MARS.
The experimental results demonstrate that our approach outperforms existing
state-of-the-art methods of video-based person re-identification on iLIDS-VID
and MARS and achieves favorable results on PRID-2011.Comment: Accepted by AAAI19 as spotligh
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
In this work, we propose a novel Spatial-Temporal Attention (STA) approach to
tackle the large-scale person re-identification task in videos. Different from
the most existing methods, which simply compute representations of video clips
using frame-level aggregation (e.g. average pooling), the proposed STA adopts a
more effective way for producing robust clip-level feature representation.
Concretely, our STA fully exploits those discriminative parts of one target
person in both spatial and temporal dimensions, which results in a 2-D
attention score matrix via inter-frame regularization to measure the
importances of spatial parts across different frames. Thus, a more robust
clip-level feature representation can be generated according to a weighted sum
operation guided by the mined 2-D attention score matrix. In this way, the
challenging cases for video-based person re-identification such as pose
variation and partial occlusion can be well tackled by the STA. We conduct
extensive experiments on two large-scale benchmarks, i.e. MARS and
DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which
significantly outperforms the state-of-the-arts with a large margin of more
than 11.6%.Comment: Accepted as a conference paper at AAAI 201
OVSNet : Towards One-Pass Real-Time Video Object Segmentation
Video object segmentation aims at accurately segmenting the target object
regions across consecutive frames. It is technically challenging for coping
with complicated factors (e.g., shape deformations, occlusion and out of the
lens). Recent approaches have largely solved them by using backforth
re-identification and bi-directional mask propagation. However, their methods
are extremely slow and only support offline inference, which in principle
cannot be applied in real time. Motivated by this observation, we propose a
efficient detection-based paradigm for video object segmentation. We propose an
unified One-Pass Video Segmentation framework (OVS-Net) for modeling
spatial-temporal representation in a unified pipeline, which seamlessly
integrates object detection, object segmentation, and object re-identification.
The proposed framework lends itself to one-pass inference that effectively and
efficiently performs video object segmentation. Moreover, we propose a
maskguided attention module for modeling the multi-scale object boundary and
multi-level feature fusion. Experiments on the challenging DAVIS 2017
demonstrate the effectiveness of the proposed framework with comparable
performance to the state-of-the-art, and the great efficiency about 11.5 FPS
towards pioneering real-time work to our knowledge, more than 5 times faster
than other state-of-the-art methods.Comment: 10 pages, 6 figure
- …