We propose a method for inferring human attributes (such as gender, hair
style, clothes style, expression, action) from images of people under large
variation of viewpoint, pose, appearance, articulation and occlusion.
Convolutional Neural Nets (CNN) have been shown to perform very well on large
scale object recognition problems. In the context of attribute classification,
however, the signal is often subtle and it may cover only a small part of the
image, while the image is dominated by the effects of pose and viewpoint.
Discounting for pose variation would require training on very large labeled
datasets which are not presently available. Part-based models, such as poselets
and DPM have been shown to perform well for this problem but they are limited
by shallow low-level features. We propose a new method which combines
part-based models and deep learning by training pose-normalized CNNs. We show
substantial improvement vs. state-of-the-art methods on challenging attribute
classification tasks in unconstrained settings. Experiments confirm that our
method outperforms both the best part-based methods on this problem and
conventional CNNs trained on the full bounding box of the person.Comment: 8 page