9,250 research outputs found
Beyond Frontal Faces: Improving Person Recognition Using Multiple Cues
We explore the task of recognizing peoples' identities in photo albums in an
unconstrained setting. To facilitate this, we introduce the new People In Photo
Albums (PIPA) dataset, consisting of over 60000 instances of 2000 individuals
collected from public Flickr photo albums. With only about half of the person
images containing a frontal face, the recognition task is very challenging due
to the large variations in pose, clothing, camera viewpoint, image resolution
and illumination. We propose the Pose Invariant PErson Recognition (PIPER)
method, which accumulates the cues of poselet-level person recognizers trained
by deep convolutional networks to discount for the pose variations, combined
with a face recognizer and a global recognizer. Experiments on three different
settings confirm that in our unconstrained setup PIPER significantly improves
on the performance of DeepFace, which is one of the best face recognizers as
measured on the LFW dataset
FML: Face Model Learning from Videos
Monocular image-based 3D reconstruction of faces is a long-standing problem
in computer vision. Since image data is a 2D projection of a 3D face, the
resulting depth ambiguity makes the problem ill-posed. Most existing methods
rely on data-driven priors that are built from limited 3D face scans. In
contrast, we propose multi-frame video-based self-supervised training of a deep
network that (i) learns a face identity model both in shape and appearance
while (ii) jointly learning to reconstruct 3D faces. Our face model is learned
using only corpora of in-the-wild video clips collected from the Internet. This
virtually endless source of training data enables learning of a highly general
3D face model. In order to achieve this, we propose a novel multi-frame
consistency loss that ensures consistent shape and appearance across multiple
frames of a subject's face, thus minimizing depth ambiguity. At test time we
can use an arbitrary number of frames, so that we can perform both monocular as
well as multi-frame reconstruction.Comment: CVPR 2019 (Oral). Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ,
Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19
Learning Deep Context-aware Features over Body and Latent Parts for Person Re-identification
Person Re-identification (ReID) is to identify the same person across
different cameras. It is a challenging task due to the large variations in
person pose, occlusion, background clutter, etc How to extract powerful
features is a fundamental problem in ReID and is still an open problem today.
In this paper, we design a Multi-Scale Context-Aware Network (MSCAN) to learn
powerful features over full body and body parts, which can well capture the
local context knowledge by stacking multi-scale convolutions in each layer.
Moreover, instead of using predefined rigid parts, we propose to learn and
localize deformable pedestrian parts using Spatial Transformer Networks (STN)
with novel spatial constraints. The learned body parts can release some
difficulties, eg pose variations and background clutters, in part-based
representation. Finally, we integrate the representation learning processes of
full body and body parts into a unified framework for person ReID through
multi-class person identification tasks. Extensive evaluations on current
challenging large-scale person ReID datasets, including the image-based
Market1501, CUHK03 and sequence-based MARS datasets, show that the proposed
method achieves the state-of-the-art results.Comment: Accepted by CVPR 201
Cognitive visual tracking and camera control
Cognitive visual tracking is the process of observing and understanding the behaviour of a moving person. This paper presents an efficient solution to extract, in real-time, high-level information from an observed scene, and generate the most appropriate commands for a set of pan-tilt-zoom (PTZ) cameras in a surveillance scenario. Such a high-level feedback control loop, which is the main novelty of our work, will serve to reduce uncertainties in the observed scene and to maximize the amount of information extracted from it. It is implemented with a distributed camera system using SQL tables as virtual communication channels, and Situation Graph Trees for knowledge representation, inference and high-level camera control. A set of experiments in a surveillance scenario show the effectiveness of our approach and its potential for real applications of cognitive vision
- ā¦