6,641 research outputs found
Occlusion Coherence: Detecting and Localizing Occluded Faces
The presence of occluders significantly impacts object recognition accuracy.
However, occlusion is typically treated as an unstructured source of noise and
explicit models for occluders have lagged behind those for object appearance
and shape. In this paper we describe a hierarchical deformable part model for
face detection and landmark localization that explicitly models part occlusion.
The proposed model structure makes it possible to augment positive training
data with large numbers of synthetically occluded instances. This allows us to
easily incorporate the statistics of occlusion patterns in a discriminatively
trained model. We test the model on several benchmarks for landmark
localization and detection including challenging new data sets featuring
significant occlusion. We find that the addition of an explicit occlusion model
yields a detection system that outperforms existing approaches for occluded
instances while maintaining competitive accuracy in detection and landmark
localization for unoccluded instances
Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation
Deep neural networks with alternating convolutional, max-pooling and
decimation layers are widely used in state of the art architectures for
computer vision. Max-pooling purposefully discards precise spatial information
in order to create features that are more robust, and typically organized as
lower resolution spatial feature maps. On some tasks, such as whole-image
classification, max-pooling derived features are well suited; however, for
tasks requiring precise localization, such as pixel level prediction and
segmentation, max-pooling destroys exactly the information required to perform
well. Precise localization may be preserved by shallow convnets without pooling
but at the expense of robustness. Can we have our max-pooled multi-layered cake
and eat it too? Several papers have proposed summation and concatenation based
methods for combining upsampled coarse, abstract features with finer features
to produce robust pixel level predictions. Here we introduce another model ---
dubbed Recombinator Networks --- where coarse features inform finer features
early in their formation such that finer features can make use of several
layers of computation in deciding how to use coarse features. The model is
trained once, end-to-end and performs better than summation-based
architectures, reducing the error from the previous state of the art on two
facial keypoint datasets, AFW and AFLW, by 30\% and beating the current
state-of-the-art on 300W without using extra data. We improve performance even
further by adding a denoising prediction model based on a novel convnet
formulation.Comment: accepted in CVPR 201
- …