Existing volumetric methods for predicting 3D human pose estimation are
accurate, but computationally expensive and optimized for single time-step
prediction. We present TEMPO, an efficient multi-view pose estimation model
that learns a robust spatiotemporal representation, improving pose accuracy
while also tracking and forecasting human pose. We significantly reduce
computation compared to the state-of-the-art by recurrently computing
per-person 2D pose features, fusing both spatial and temporal information into
a single representation. In doing so, our model is able to use spatiotemporal
context to predict more accurate human poses without sacrificing efficiency. We
further use this representation to track human poses over time as well as
predict future poses. Finally, we demonstrate that our model is able to
generalize across datasets without scene-specific fine-tuning. TEMPO achieves
10% better MPJPE with a 33× improvement in FPS compared to TesseTrack
on the challenging CMU Panoptic Studio dataset.Comment: Accepted at ICCV 202