In this paper, we propose a table and image generation task to verify how the
knowledge about entities acquired from natural language is retained in Vision &
Language (V & L) models. This task consists of two parts: the first is to
generate a table containing knowledge about an entity and its related image,
and the second is to generate an image from an entity with a caption and a
table containing related knowledge of the entity. In both tasks, the model must
know the entities used to perform the generation properly. We created the
Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000
infoboxes in English Wikipedia articles to perform the proposed tasks. We
evaluated the performance on the tasks with respect to the above research
question using the V & L model OFA, which has achieved state-of-the-art results
in multiple tasks. Experimental results show that OFA forgets part of its
entity knowledge by pre-training as a complement to improve the performance of
image related tasks.Comment: Accepted at ACL 202