This paper presents a method to learn hand-object interaction prior for
reconstructing a 3D hand-object scene from a single RGB image. The inference as
well as training-data generation for 3D hand-object scene reconstruction is
challenging due to the depth ambiguity of a single image and occlusions by the
hand and object. We turn this challenge into an opportunity by utilizing the
hand shape to constrain the possible relative configuration of the hand and
object geometry. We design a generalizable implicit function, HandNeRF, that
explicitly encodes the correlation of the 3D hand shape features and 2D object
features to predict the hand and object scene geometry. With experiments on
real-world datasets, we show that HandNeRF is able to reconstruct hand-object
scenes of novel grasp configurations more accurately than comparable methods.
Moreover, we demonstrate that object reconstruction from HandNeRF ensures more
accurate execution of a downstream task, such as grasping for robotic
hand-over.Comment: 9 pages, 4 tables, 7 figure