We tackle the task of scalable unsupervised object-centric representation
learning on 3D scenes. Existing approaches to object-centric representation
learning show limitations in generalizing to larger scenes as their learning
processes rely on a fixed global coordinate system. In contrast, we propose to
learn view-invariant 3D object representations in localized object coordinate
systems. To this end, we estimate the object pose and appearance representation
separately and explicitly map object representations across views while
maintaining object identities. We adopt an amortized variational inference
pipeline that can process sequential input and scalably update object latent
distributions online. To handle large-scale scenes with a varying number of
objects, we further introduce a Cognitive Map that allows the registration and
query of objects on a per-scene global map to achieve scalable representation
learning. We explore the object-centric neural radiance field (NeRF) as our 3D
scene representation, which is jointly modeled within our unsupervised
object-centric learning framework. Experimental results on synthetic and real
datasets show that our proposed method can infer and maintain object-centric
representations of 3D scenes and outperforms previous models