More is Better: 3D Human Pose Estimation from Complementary Data Sources

Abstract

Computer Vision (CV) research has been playing a strategic role in many different complex scenarios that are becoming fundamental components in our everyday life. From Augmented/Virtual reality (AR/VR) to Human-Robot interactions, having a visual interpretation of the surrounding world is the first and most important step to develop new advanced systems. As in other research areas, the boost in performance in Computer Vision algorithms has to be mainly attributed to the widespread usage of deep neural networks. Rather than selecting handcrafted features, such approaches identify which are the best features needed to solve a specific task, by learning them from a corpus of carefully annotated data. Such important property of these neural networks comes with a price: they need very large data collections to learn from. Collecting data is a time consuming and expensive operation that varies, being much harder for some tasks than others. In order to limit additional data collection, we therefore need to carefully design models that can extract as much information as possible from already available dataset, even those collected for neighboring domains. In this work I focus on exploring different solutions for and important research problem in Computer Vision, 3D human pose estimation, that is the task of estimating the 3D skeletal representation of a person characterized in an image/s. This has been done for several configurations: monocular camera, multi-view systems and from egocentric perspectives. First, from a single external front facing camera a semi-supervised approach is used to regress the set of 3D joint positions of the represented person. This is done by fully exploiting all of the available information at all the levels of the network, in a novel manner, as well as allowing the model to be trained with partially labelled data. A multi-camera 3D human pose estimation system is introduced by designing a network trainable in a semi-supervised or even unsupervised manner in a multiview system. Unlike standard motion-captures algorithm, demanding a long and time consuming configuration setup at the beginning of each capturing session, this novel approach requires little to none initial system configuration. Finally, a novel architecture is developed to work in a very specific and significantly harder configuration: 3D human pose estimation when using cameras embedded in a head mounted display (HMD). Due to the limited data availability, the model needs to carefully extract information from the data to properly generalize on unseen images. This is particularly useful in AR/VR use case scenarios, demonstrating the versatility of our network to various working conditions

    Similar works