Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Constantin, Victor; Fua, Pascal; Honari, Sina; Rhodin, Helge; Salzmann, Mathieu

Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Authors: Victor Constantin
Pascal Fua
Sina Honari
Helge Rhodin
Mathieu Salzmann
Publication date: 25 March 2021
Publisher

Abstract

In the presence of annotated data, deep human pose estimation networks yield impressive performance. Nevertheless, annotating new data is extremely time-consuming, particularly in real-world conditions. Here, we address this by leveraging contrastive self-supervised (CSS) learning to extract rich latent vectors from single-view videos. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space, well-suited for human pose estimation. Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2012.01511

Last time updated on 02/03/2021