We introduce TIDEE, an embodied agent that tidies up a disordered scene based
on learned commonsense object placement and room arrangement priors. TIDEE
explores a home environment, detects objects that are out of their natural
place, infers plausible object contexts for them, localizes such contexts in
the current scene, and repositions the objects. Commonsense priors are encoded
in three modules: i) visuo-semantic detectors that detect out-of-place objects,
ii) an associative neural graph memory of objects and spatial relations that
proposes plausible semantic receptacles and surfaces for object repositions,
and iii) a visual search network that guides the agent's exploration for
efficiently localizing the receptacle-of-interest in the current scene to
reposition the object. We test TIDEE on tidying up disorganized scenes in the
AI2THOR simulation environment. TIDEE carries out the task directly from pixel
and raw depth input without ever having observed the same room beforehand,
relying only on priors learned from a separate set of training houses. Human
evaluations on the resulting room reorganizations show TIDEE outperforms
ablative versions of the model that do not use one or more of the commonsense
priors. On a related room rearrangement benchmark that allows the agent to view
the goal state prior to rearrangement, a simplified version of our model
significantly outperforms a top-performing method by a large margin. Code and
data are available at the project website: https://tidee-agent.github.io/