We aim to obtain an interpretable, expressive, and disentangled scene
representation that contains comprehensive structural and textural information
for each object. Previous scene representations learned by neural networks are
often uninterpretable, limited to a single object, or lacking 3D knowledge. In
this work, we propose 3D scene de-rendering networks (3D-SDN) to address the
above issues by integrating disentangled representations for semantics,
geometry, and appearance into a deep generative model. Our scene encoder
performs inverse graphics, translating a scene into a structured object-wise
representation. Our decoder has two components: a differentiable shape renderer
and a neural texture generator. The disentanglement of semantics, geometry, and
appearance supports 3D-aware scene manipulation, e.g., rotating and moving
objects freely while keeping the consistent shape and texture, and changing the
object appearance without affecting its shape. Experiments demonstrate that our
editing scheme based on 3D-SDN is superior to its 2D counterpart.Comment: NeurIPS 2018. Code: https://github.com/ysymyth/3D-SDN Website:
http://3dsdn.csail.mit.edu