3,496 research outputs found
Learning Social Relation Traits from Face Images
Social relation defines the association, e.g, warm, friendliness, and
dominance, between two or more people. Motivated by psychological studies, we
investigate if such fine-grained and high-level relation traits can be
characterised and quantified from face images in the wild. To address this
challenging problem we propose a deep model that learns a rich face
representation to capture gender, expression, head pose, and age-related
attributes, and then performs pairwise-face reasoning for relation prediction.
To learn from heterogeneous attribute sources, we formulate a new network
architecture with a bridging layer to leverage the inherent correspondences
among these datasets. It can also cope with missing target attribute labels.
Extensive experiments show that our approach is effective for fine-grained
social relation learning in images and videos.Comment: To appear in International Conference on Computer Vision (ICCV) 201
Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions
3D action recognition has broad applications in human-computer interaction
and intelligent surveillance. However, recognizing similar actions remains
challenging since previous literature fails to capture motion and shape cues
effectively from noisy depth data. In this paper, we propose a novel two-layer
Bag-of-Visual-Words (BoVW) model, which suppresses the noise disturbances and
jointly encodes both motion and shape cues. First, background clutter is
removed by a background modeling method that is designed for depth data. Then,
motion and shape cues are jointly used to generate robust and distinctive
spatial-temporal interest points (STIPs): motion-based STIPs and shape-based
STIPs. In the first layer of our model, a multi-scale 3D local steering kernel
(M3DLSK) descriptor is proposed to describe local appearances of cuboids around
motion-based STIPs. In the second layer, a spatial-temporal vector (STV)
descriptor is proposed to describe the spatial-temporal distributions of
shape-based STIPs. Using the Bag-of-Visual-Words (BoVW) model, motion and shape
cues are combined to form a fused action representation. Our model performs
favorably compared with common STIP detection and description methods. Thorough
experiments verify that our model is effective in distinguishing similar
actions and robust to background clutter, partial occlusions and pepper noise
- …