Calixto, I.

Frank, A.

Gatt, A.

Parcalabescu, L.

English

We investigate the reasoning ability of pretrained vision and language (V&amp;L)models in two tasks that require multimodal integration: (1) discriminating acorrect image-sentence pair from an incorrect one, and (2) counting entities inan image. We evaluate three pretrained V&amp;L models on these tasks: ViLBERT,ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our resultsshow that models solve task (1) very well, as expected, since all models arepretrained on task (1). However, none of the pretrained V&amp;L models is able toadequately solve task (2), our counting probe, and they cannot generalise toout-of-distribution quantities. We propose a number of explanations for thesefindings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence ofcatastrophic forgetting on task (1). Concerning our results on the countingprobe, we find evidence that all models are impacted by dataset bias, and alsofail to individuate entities in the visual input. While a selling point ofpretrained V&amp;L models is their ability to solve complex tasks, our findingssuggest that understanding their reasoning and grounding capabilities requiresmore targeted investigations on specific phenomena

International Migration, Integration and Social Cohesion online publications

Seeing past words: Testing the cross-modal capabilities of pretrained V&amp;L models on counting tasks

https://pure.uva.nl/ws/files/99339949/2021.mmsr_1.4.pdf

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Abstract

Similar works

Full text

Available Versions

International Migration, Integration and Social Cohesion online publications

Seeing past words: Testing the cross-modal capabilities of pretrained V&amp;L models on counting tasks

Abstract

Similar works

Full text

Available Versions

International Migration, Integration and Social Cohesion online publications

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks