Recovering the skeletal shape of an animal from a monocular video is a
longstanding challenge. Prevailing animal reconstruction methods often adopt a
control-point driven animation model and optimize bone transforms individually
without considering skeletal topology, yielding unsatisfactory shape and
articulation. In contrast, humans can easily infer the articulation structure
of an unknown animal by associating it with a seen articulated character in
their memory. Inspired by this fact, we present CASA, a novel Category-Agnostic
Skeletal Animal reconstruction method consisting of two major components: a
video-to-shape retrieval process and a neural inverse graphics framework.
During inference, CASA first retrieves an articulated shape from a 3D character
assets bank so that the input video scores highly with the rendered image,
according to a pretrained language-vision model. CASA then integrates the
retrieved character into an inverse graphics framework and jointly infers the
shape deformation, skeleton structure, and skinning weights through
optimization. Experiments validate the efficacy of CASA regarding shape
reconstruction and articulation. We further demonstrate that the resulting
skeletal-animated characters can be used for re-animation.Comment: Accepted to NeurIPS 202