We propose a novel transformer-based framework that reconstructs two high
fidelity hands from multi-view RGB images. Unlike existing hand pose estimation
methods, where one typically trains a deep network to regress hand model
parameters from single RGB image, we consider a more challenging problem
setting where we directly regress the absolute root poses of two-hands with
extended forearm at high resolution from egocentric view. As existing datasets
are either infeasible for egocentric viewpoints or lack background variations,
we create a large-scale synthetic dataset with diverse scenarios and collect a
real dataset from multi-calibrated camera setup to verify our proposed
multi-view image feature fusion strategy. To make the reconstruction physically
plausible, we propose two strategies: (i) a coarse-to-fine spectral graph
convolution decoder to smoothen the meshes during upsampling and (ii) an
optimisation-based refinement stage at inference to prevent self-penetrations.
Through extensive quantitative and qualitative evaluations, we show that our
framework is able to produce realistic two-hand reconstructions and demonstrate
the generalisation of synthetic-trained models to real data, as well as
real-time AR/VR applications.Comment: Accepted to ICCV 202