We propose a method for multi-person detection and 2-D pose estimation that
achieves state-of-art results on the challenging COCO keypoints task. It is a
simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of boxes which are
likely to contain people; for this we use the Faster RCNN detector. In the
second stage, we estimate the keypoints of the person potentially contained in
each proposed bounding box. For each keypoint type we predict dense heatmaps
and offsets using a fully convolutional ResNet. To combine these outputs we
introduce a novel aggregation procedure to obtain highly localized keypoint
predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression
(NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based
confidence score estimation, instead of box-level scoring.
Trained on COCO data alone, our final system achieves average precision of
0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming
the winner of the 2016 COCO keypoints challenge and other recent state-of-art.
Further, by using additional in-house labeled data we obtain an even higher
average precision of 0.685 on the test-dev set and 0.673 on the test-standard
set, more than 5% absolute improvement compared to the previous best performing
method on the same dataset.Comment: Paper describing an improved version of the G-RMI entry to the 2016
COCO keypoints challenge (http://image-net.org/challenges/ilsvrc+coco2016).
Camera ready version to appear in the Proceedings of CVPR 201