Despite the impressive performance achieved by pre-trained
language-and-vision models in downstream tasks, it remains an open question
whether this reflects a proper understanding of image-text interaction. In this
work, we explore to what extent they handle basic linguistic constructions --
active-passive voice, coordination, and relative clauses -- that even preschool
children can typically master. We present BLA, a novel, automatically
constructed benchmark to evaluate multimodal models on these Basic Language
Abilities. We show that different types of Transformer-based systems, such as
CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting,
in line with previous findings. Our experiments, in particular, show that most
of the tested models only marginally benefit when fine-tuned or prompted with
construction-specific samples. Yet, the generative BLIP2 shows promising
trends, especially in an in-context learning setting. This opens the door to
using BLA not only as an evaluation benchmark but also to improve models' basic
language abilities.Comment: This is the camera-ready version of the paper that will be published
in the Proceedings of EMNLP 2023 (Singapore, 6-10 December 2023