In this paper, we tackle the task of scene-aware 3D human motion forecasting,
which consists of predicting future human poses given a 3D scene and a past
human motion. A key challenge of this task is to ensure consistency between the
human and the scene, accounting for human-scene interactions. Previous attempts
to do so model such interactions only implicitly, and thus tend to produce
artifacts such as "ghost motion" because of the lack of explicit constraints
between the local poses and the global motion. Here, by contrast, we propose to
explicitly model the human-scene contacts. To this end, we introduce
distance-based contact maps that capture the contact relationships between
every joint and every 3D scene point at each time instant. We then develop a
two-stage pipeline that first predicts the future contact maps from the past
ones and the scene point cloud, and then forecasts the future human poses by
conditioning them on the predicted contact maps. During training, we explicitly
encourage consistency between the global motion and the local poses via a prior
defined using the contact maps and future poses. Our approach outperforms the
state-of-the-art human motion forecasting and human synthesis methods on both
synthetic and real datasets. Our code is available at
https://github.com/wei-mao-2019/ContAwareMotionPred.Comment: Accepted to NeurIPS202