Accurate 3D human pose estimation from single images is possible with
sophisticated deep-net architectures that have been trained on very large
datasets. However, this still leaves open the problem of capturing motions for
which no such database exists. Manual annotation is tedious, slow, and
error-prone. In this paper, we propose to replace most of the annotations by
the use of multiple views, at training time only. Specifically, we train the
system to predict the same pose in all views. Such a consistency constraint is
necessary but not sufficient to predict accurate poses. We therefore complement
it with a supervised loss aiming to predict the correct pose in a small set of
labeled images, and with a regularization term that penalizes drift from
initial predictions. Furthermore, we propose a method to estimate camera pose
jointly with human pose, which lets us utilize multi-view footage where
calibration is difficult, e.g., for pan-tilt or moving handheld cameras. We
demonstrate the effectiveness of our approach on established benchmarks, as
well as on a new Ski dataset with rotating cameras and expert ski motion, for
which annotations are truly hard to obtain.Comment: CVPR 2018, Ski-Pose PTZ-Camera Dataset availabl