In the presence of annotated data, deep human pose estimation networks yield
impressive performance. Nevertheless, annotating new data is extremely
time-consuming, particularly in real-world conditions. Here, we address this by
leveraging contrastive self-supervised (CSS) learning to extract rich latent
vectors from single-view videos. Instead of simply treating the latent features
of nearby frames as positive pairs and those of temporally-distant ones as
negative pairs as in other CSS approaches, we explicitly disentangle each
latent vector into a time-variant component and a time-invariant one. We then
show that applying CSS only to the time-variant features, while also
reconstructing the input and encouraging a gradual transition between nearby
and away features, yields a rich latent space, well-suited for human pose
estimation. Our approach outperforms other unsupervised single-view methods and
matches the performance of multi-view techniques