Search CORE

4 research outputs found

Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localization

Author: Choi Wongun
Li Fenghai
Weng Renliang
Zhan Yu
Publication venue
Publication date: 30/03/2022
Field of study

In this paper, we propose a novel monocular ray-based 3D (Ray3D) absolute human pose estimation with calibrated camera. Accurate and generalizable absolute 3D human pose estimation from monocular 2D pose input is an ill-posed problem. To address this challenge, we convert the input from pixel space to 3D normalized rays. This conversion makes our approach robust to camera intrinsic parameter changes. To deal with the in-the-wild camera extrinsic parameter variations, Ray3D explicitly takes the camera extrinsic parameters as an input and jointly models the distribution between the 3D pose rays and camera extrinsic parameters. This novel network design is the key to the outstanding generalizability of Ray3D approach. To have a comprehensive understanding of how the camera intrinsic and extrinsic parameter variations affect the accuracy of absolute 3D key-point localization, we conduct in-depth systematic experiments on three single person 3D benchmarks as well as one synthetic benchmark. These experiments demonstrate that our method significantly outperforms existing state-of-the-art models. Our code and the synthetic dataset are available at https://github.com/YxZhxn/Ray3D .Comment: Accepted by CVPR 202

arXiv.org e-Print Archive

Robust face alignment and partial face recognition

Author: Weng Renliang
Publication venue: 'Nanyang Technological University'
Publication date: 01/01/2016
Field of study

Face alignment and face recognition are two fundamental problems in the facial analysis community. For face alignment, it forms the basis for the accurate face recognition, age estimation, and facial expression recognition. For face recognition, it has been widely applied in various practical scenarios such as access control system, massive surveillance, human computer interaction, etc.. There mainly exist two lines of works in these two ﬁelds, namely holistic face alignment and recognition, and partial face alignment and recognition. Numerous holistic face alignment and recognition works have been proposed and recent state of the arts have surpassed human’s recognition capability on the challenging LFW dataset. One of the major challenges of this area lies on designing robust holistic face alignment method which can accurately detect landmarks from faces with large facial poses. On the contrary, relatively few works have been proposed to deal with partial face alignment and recognition, and they have achieved limited success. In this thesis, we aim to advance the holistic face alignment and contribute to the ﬁeld of partial face alignment and recognition. In particular, for the holistic face alignment, we devise two deep learning based approaches which are capable of estimating facial landmark positions with great robustness and high accuracy. In terms of the partial face alignment and recognition, we present an approach based on robust feature set matching, which achieves partial face alignment and recognition jointly in a single framework. For the holistic face alignment, we are interested in the facial landmark detection problem. The mainstream face landmark detection approaches consist of a pose initialization stage and a pose update step. The pose initialization step derives an initial pose for face alignment. Since the face landmark detection is a highly non-convex problem, this initial pose largely determines the local basin where the ﬁnal solution arrives. The pose update stage then locally reﬁnes the initial pose to achieve high alignment accuracy. Both of these two steps are critical for achieving robust and accurate face alignment performance. In our ﬁrst work, to improve the robustness of the pose initialization step against large pose variations, we devise a Global Exemplar-based Deep Auto-encoder Network (GEDAN), whose top regression layer deploys several exemplars to assist pose estimation. For the pose update stage, we design a series of Localized Deep Auto-encoder Networks (LDAN). Speciﬁcally, its ﬁrst layer consists of individual Local Auto-Encoders (LAEs). Each LAE aims to extract pose-related features from its corresponding local patch. The outputs of these LAEs are then directly fed into their corresponding local regressors. In addition, these outputs are concatenated into a global feature vector which is further encoded by several layers of auto-encoders to preserve the global facial structure. By assembling GEDAN and several LDANs together in a coarse-to-ﬁne way, our approach achieves superior alignment accuracy with real-time speed. We term this network ensemble as Cascaded Deep Auto-encoder Networks (CDAN). While CDAN works well on near-upright faces, it’s incapable of detecting landmarks from arbitrarily rotated facial images. To this end, we leverage the strength of the Convolutional Neural Networks (CNN) and devise a Hierarchical CNN (HiCNN) cascade. In particular, HiCNN consists of a global CNN, a part-based CNN and a patch-based CNN. The global CNN generates a preliminary four landmark conﬁguration from the low-resolution facial image. Based on this preliminary result, landmark positions are estimated by the part-based CNN based on the corresponding facial parts on a larger resolution. Lastly, the patch-based CNN reﬁnes the landmark positions from the view of pose-indexed patches at the highest resolution. Extensive experiments on three bench-marks show that the proposed HiCNN can accurately detect landmarks from facial images with arbitrary in-plane rotation, large scale variations and random face shifts. Both CDAN and HiCNN are holistic face alignment methods, they may fail if the facial image is an arbitrary facial patch. In realistic scenarios, however, faces might be severely occluded or randomly cropped, resulting in partial faces. It’s desirable to automatically align these partial faces to holistic facial image and subsequently recognize them. To this end, we propose a new partial face recognition approach named Robust Point Set Matching (RPSM) by using feature set matching, which is able to align partial face patches to holistic gallery faces automatically and is robust to occlusions and illumination changes. Given each gallery image and probe face patch, we ﬁrst detect keypoints and extract their local features. Then, the RPSM matches the extracted local feature sets by minimizing the geometric and textural difference. Lastly, the similarity of two faces is converted as the distance between two feature sets. The matching problem is formulated in a linear programming framework; hence, constraint of afﬁne transformation can be easily applied to restrain from unrealistic face warping. The proposed RPSM achieves superior results both on partial face alignment and partial face recognition on four public face datasets.DOCTOR OF PHILOSOPHY (EEE

DR-NTU (Digital Repository of NTU)

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

Author: Dai Peng
Feng Yiqiang
Weng Renliang
Zhang Changshui
Publication venue
Publication date: 30/05/2022
Field of study

The recent trend in multiple object tracking (MOT) is heading towards leveraging deep learning to boost the tracking performance. In this paper, we propose a novel solution named TransSTAM, which leverages Transformer to effectively model both the appearance features of each object and the spatial-temporal relationships among objects. TransSTAM consists of two major parts: (1) The encoder utilizes the powerful self-attention mechanism of Transformer to learn discriminative features for each tracklet; (2) The decoder adopts the standard cross-attention mechanism to model the affinities between the tracklets and the detections by taking both spatial-temporal and appearance features into account. TransSTAM has two major advantages: (1) It is solely based on the encoder-decoder architecture and enjoys a compact network design, hence being computationally efficient; (2) It can effectively learn spatial-temporal and appearance features within one model, hence achieving better tracking accuracy. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA with respect to previous state-of-the-art approaches on all the benchmarks. Our code is available at \url{https://github.com/icicle4/TranSTAM}

arXiv.org e-Print Archive