The introduction of neural radiance fields has greatly improved the
effectiveness of view synthesis for monocular videos. However, existing
algorithms face difficulties when dealing with uncontrolled or lengthy
scenarios, and require extensive training time specific to each new scenario.
To tackle these limitations, we propose DynPoint, an algorithm designed to
facilitate the rapid synthesis of novel views for unconstrained monocular
videos. Rather than encoding the entirety of the scenario information into a
latent representation, DynPoint concentrates on predicting the explicit 3D
correspondence between neighboring frames to realize information aggregation.
Specifically, this correspondence prediction is achieved through the estimation
of consistent depth and scene flow information across frames. Subsequently, the
acquired correspondence is utilized to aggregate information from multiple
reference frames to a target frame, by constructing hierarchical neural point
clouds. The resulting framework enables swift and accurate view synthesis for
desired views of target frames. The experimental results obtained demonstrate
the considerable acceleration of training time achieved - typically an order of
magnitude - by our proposed method while yielding comparable outcomes compared
to prior approaches. Furthermore, our method exhibits strong robustness in
handling long-duration videos without learning a canonical representation of
video content