We present a novel framework for reconstructing animatable human avatars from
multiple images, termed CanonicalFusion. Our central concept involves
integrating individual reconstruction results into the canonical space. To be
specific, we first predict Linear Blend Skinning (LBS) weight maps and depth
maps using a shared-encoder-dual-decoder network, enabling direct
canonicalization of the 3D mesh from the predicted depth maps. Here, instead of
predicting high-dimensional skinning weights, we infer compressed skinning
weights, i.e., 3-dimensional vector, with the aid of pre-trained MLP networks.
We also introduce a forward skinning-based differentiable rendering scheme to
merge the reconstructed results from multiple images. This scheme refines the
initial mesh by reposing the canonical mesh via the forward skinning and by
minimizing photometric and geometric errors between the rendered and the
predicted results. Our optimization scheme considers the position and color of
vertices as well as the joint angles for each image, thereby mitigating the
negative effects of pose errors. We conduct extensive experiments to
demonstrate the effectiveness of our method and compare our CanonicalFusion
with state-of-the-art methods. Our source codes are available at
https://github.com/jsshin98/CanonicalFusion.Comment: ECCV 2024 Accepted (18 pages, 9 figures