In this paper, we focus on the challenges of modeling deformable 3D objects
from casual videos. With the popularity of neural radiance fields (NeRF), many
works extend it to dynamic scenes with a canonical NeRF and a deformation model
that achieves 3D point transformation between the observation space and the
canonical space. Recent works rely on linear blend skinning (LBS) to achieve
the canonical-observation transformation. However, the linearly weighted
combination of rigid transformation matrices is not guaranteed to be rigid. As
a matter of fact, unexpected scale and shear factors often appear. In practice,
using LBS as the deformation model can always lead to skin-collapsing artifacts
for bending or twisting motions. To solve this problem, we propose neural dual
quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can
perform rigid transformation without skin-collapsing artifacts. In the endeavor
to register 2D pixels across different frames, we establish a correspondence
between canonical feature embeddings that encodes 3D points within the
canonical space, and 2D image features by solving an optimal transport problem.
Besides, we introduce a texture filtering approach for texture rendering that
effectively minimizes the impact of noisy colors outside target deformable
objects. Extensive experiments on real and synthetic datasets show that our
approach can reconstruct 3D models for humans and animals with better
qualitative and quantitative performance than state-of-the-art methods