Obtaining photorealistic reconstructions of objects from sparse views is
inherently ambiguous and can only be achieved by learning suitable
reconstruction priors. Earlier works on sparse rigid object reconstruction
successfully learned such priors from large datasets such as CO3D. In this
paper, we extend this approach to dynamic objects. We use cats and dogs as a
representative example and introduce Common Pets in 3D (CoP3D), a collection of
crowd-sourced videos showing around 4,200 distinct pets. CoP3D is one of the
first large-scale datasets for benchmarking non-rigid 3D reconstruction "in the
wild". We also propose Tracker-NeRF, a method for learning 4D reconstruction
from our dataset. At test time, given a small number of video frames of an
unseen object, Tracker-NeRF predicts the trajectories of its 3D points and
generates new views, interpolating viewpoint and time. Results on CoP3D reveal
significantly better non-rigid new-view synthesis performance than existing
baselines