Electrical and Electronic Engineering, Imperial College London
Abstract
This thesis deals with the investigation of novel techniques for human pose estimation (HPE) using sparse depth/3D data, in order to develop a standalone, high-accuracy, low-latency human pose estimation module, suitable for deployment in systems with limited processing resources.
Based on the existing work and motivated by the significant progress that has been achieved in the relevant fields, two novel methods for the estimation and tracking of the human pose utilising sparse depth/3D data, are proposed.
First, a real-time human pose estimation and tracking framework is developed, which builds upon an already established human-template-tracking based approach, utilising the 3D Signed Distance Function (SDF) data representation. A series of complementary tracking features are introduced, tackling specifically the issues of free space violation, body part visibility and leg intersection, which are typically encountered under real-life monitoring conditions. The method is experimentally evaluated on a series of publicly available datasets, achieving state-of-the-art (SOA) performance, while also successfully utilised for human behavioural modelling on an autonomous robotic platform.
Due to inherent limitations of this tracking-based approach, such as the requirement for clearly segmented human/background data and the use of an out-of-the-box initialiser, a second, deep learning-based architecture is investigated. Specifically, a detection-based 3D-CNN architecture for 3D human pose estimation from 3D data is introduced, following the sequential network architecture paradigm. It utilises a volumetric data representation, and generates 3D heatmaps corresponding to potential locations of the human joints in the scene, achieving state-of-the-art accuracy. Additionally, a 3D body-part detector is incorporated, extending the architecture towards multi-person 3D pose estimation, the first such method for 3D data.
However, the 3D CNN architecture comes at a steep computational cost, making it unsuitable for implementation on low power systems. Thus, the final contribution of this thesis includes the investigation of computationally efficient 3D CNN design guidelines, in order to reduce the computational complexity of the developed model. The result of this investigation is a novel 3D-CNN architecture for multi-person pose estimation from 3D data, composed mainly of 3D depthwise residual bottleneck units, SE blocks and a decomposed strided input layer. This optimised version performs comparably to SOA methods on two public datasets, while requiring significantly fewer computational resources and achieving a speedup of over 100x on a modern low power mobile device, and a reduction in model size of approximately 50x.Open Acces