10,887 research outputs found
Context-aware CNNs for person head detection
Person detection is a key problem for many computer vision tasks. While face
detection has reached maturity, detecting people under a full variation of
camera view-points, human poses, lighting conditions and occlusions is still a
difficult challenge. In this work we focus on detecting human heads in natural
scenes. Starting from the recent local R-CNN object detector, we extend it with
two types of contextual cues. First, we leverage person-scene relations and
propose a Global CNN model trained to predict positions and scales of heads
directly from the full image. Second, we explicitly model pairwise relations
among objects and train a Pairwise CNN model using a structured-output
surrogate loss. The Local, Global and Pairwise models are combined into a joint
CNN framework. To train and test our full model, we introduce a large dataset
composed of 369,846 human heads annotated in 224,740 movie frames. We evaluate
our method and demonstrate improvements of person head detection against
several recent baselines in three datasets. We also show improvements of the
detection speed provided by our model.Comment: To appear in International Conference on Computer Vision (ICCV), 201
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
We aim for zero-shot localization and classification of human actions in
video. Where traditional approaches rely on global attribute or object
classification scores for their zero-shot knowledge transfer, our main
contribution is a spatial-aware object embedding. To arrive at spatial
awareness, we build our embedding on top of freely available actor and object
detectors. Relevance of objects is determined in a word embedding space and
further enforced with estimated spatial preferences. Besides local object
awareness, we also embed global object awareness into our embedding to maximize
actor and object interaction. Finally, we exploit the object positions and
sizes in the spatial-aware embedding to demonstrate a new spatio-temporal
action retrieval scenario with composite queries. Action localization and
classification experiments on four contemporary action video datasets support
our proposal. Apart from state-of-the-art results in the zero-shot localization
and classification settings, our spatial-aware embedding is even competitive
with recent supervised action localization alternatives.Comment: ICC
Multi-Context Attention for Human Pose Estimation
In this paper, we propose to incorporate convolutional neural networks with a
multi-context attention mechanism into an end-to-end framework for human pose
estimation. We adopt stacked hourglass networks to generate attention maps from
features at multiple resolutions with various semantics. The Conditional Random
Field (CRF) is utilized to model the correlations among neighboring regions in
the attention map. We further combine the holistic attention model, which
focuses on the global consistency of the full human body, and the body part
attention model, which focuses on the detailed description for different body
parts. Hence our model has the ability to focus on different granularity from
local salient regions to global semantic-consistent spaces. Additionally, we
design novel Hourglass Residual Units (HRUs) to increase the receptive field of
the network. These units are extensions of residual units with a side branch
incorporating filters with larger receptive fields, hence features with various
scales are learned and combined within the HRUs. The effectiveness of the
proposed multi-context attention mechanism and the hourglass residual units is
evaluated on two widely used human pose estimation benchmarks. Our approach
outperforms all existing methods on both benchmarks over all the body parts.Comment: The first two authors contribute equally to this wor
- …