2 research outputs found
Epipolar Transformers
A common approach to localize 3D human joints in a synchronized and
calibrated multi-view setup consists of two-steps: (1) apply a 2D detector
separately on each view to localize joints in 2D, and (2) perform robust
triangulation on 2D detections from each view to acquire the 3D joint
locations. However, in step 1, the 2D detector is limited to solving
challenging cases which could potentially be better resolved in 3D, such as
occlusions and oblique viewing angles, purely in 2D without leveraging any 3D
information. Therefore, we propose the differentiable "epipolar transformer",
which enables the 2D detector to leverage 3D-aware features to improve 2D pose
estimation. The intuition is: given a 2D location p in the current view, we
would like to first find its corresponding point p' in a neighboring view, and
then combine the features at p' with the features at p, thus leading to a
3D-aware feature at p. Inspired by stereo matching, the epipolar transformer
leverages epipolar constraints and feature matching to approximate the features
at p'. Experiments on InterHand and Human3.6M show that our approach has
consistent improvements over the baselines. Specifically, in the condition
where no external data is used, our Human3.6M model trained with ResNet-50
backbone and image size 256 x 256 outperforms state-of-the-art by 4.23 mm and
achieves MPJPE 26.9 mm.Comment: CVPR 202
Multi-Person Absolute 3D Human Pose Estimation with Weak Depth Supervision
In 3D human pose estimation one of the biggest problems is the lack of large,
diverse datasets. This is especially true for multi-person 3D pose estimation,
where, to our knowledge, there are only machine generated annotations available
for training. To mitigate this issue, we introduce a network that can be
trained with additional RGB-D images in a weakly supervised fashion. Due to the
existence of cheap sensors, videos with depth maps are widely available, and
our method can exploit a large, unannotated dataset. Our algorithm is a
monocular, multi-person, absolute pose estimator. We evaluate the algorithm on
several benchmarks, showing a consistent improvement in error rates. Also, our
model achieves state-of-the-art results on the MuPoTS-3D dataset by a
considerable margin