We are interested in image manipulation via natural language text -- a task
that is useful for multiple AI applications but requires complex reasoning over
multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning
(NSCL), which has been quite effective for the task of Visual Question
Answering (VQA), for the task of image manipulation. Our system referred to as
NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and
only requires weak supervision in the form of annotated data for VQA. NeuroSIM
parses an instruction into a symbolic program, based on a Domain Specific
Language (DSL) comprising of object attributes and manipulation operations,
that guides its execution. We create a new dataset for the task, and extensive
experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA
baselines that make use of supervised data for manipulation.Comment: EMNLP 2023 (long paper, main conference