53,536 research outputs found
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
In this work, we propose a novel Spatial-Temporal Attention (STA) approach to
tackle the large-scale person re-identification task in videos. Different from
the most existing methods, which simply compute representations of video clips
using frame-level aggregation (e.g. average pooling), the proposed STA adopts a
more effective way for producing robust clip-level feature representation.
Concretely, our STA fully exploits those discriminative parts of one target
person in both spatial and temporal dimensions, which results in a 2-D
attention score matrix via inter-frame regularization to measure the
importances of spatial parts across different frames. Thus, a more robust
clip-level feature representation can be generated according to a weighted sum
operation guided by the mined 2-D attention score matrix. In this way, the
challenging cases for video-based person re-identification such as pose
variation and partial occlusion can be well tackled by the STA. We conduct
extensive experiments on two large-scale benchmarks, i.e. MARS and
DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which
significantly outperforms the state-of-the-arts with a large margin of more
than 11.6%.Comment: Accepted as a conference paper at AAAI 201
Joint Multi-Person Pose Estimation and Semantic Part Segmentation
Human pose estimation and semantic part segmentation are two complementary
tasks in computer vision. In this paper, we propose to solve the two tasks
jointly for natural multi-person images, in which the estimated pose provides
object-level shape prior to regularize part segments while the part-level
segments constrain the variation of pose locations. Specifically, we first
train two fully convolutional neural networks (FCNs), namely Pose FCN and Part
FCN, to provide initial estimation of pose joint potential and semantic part
potential. Then, to refine pose joint location, the two types of potentials are
fused with a fully-connected conditional random field (FCRF), where a novel
segment-joint smoothness term is used to encourage semantic and spatial
consistency between parts and joints. To refine part segments, the refined pose
and the original part potential are integrated through a Part FCN, where the
skeleton feature from pose serves as additional regularization cues for part
segments. Finally, to reduce the complexity of the FCRF, we induce human
detection boxes and infer the graph inside each box, making the inference forty
times faster.
Since there's no dataset that contains both part segments and pose labels, we
extend the PASCAL VOC part dataset with human pose joints and perform extensive
experiments to compare our method against several most recent strategies. We
show that on this dataset our algorithm surpasses competing methods by a large
margin in both tasks.Comment: This paper has been accepted by CVPR 201
- …