Grounding language in vision is an active field of research seeking to
construct cognitively plausible word and sentence representations by
incorporating perceptual knowledge from vision into text-based representations.
Despite many attempts at language grounding, achieving an optimal equilibrium
between textual representations of the language and our embodied experiences
remains an open field. Some common concerns are the following. Is visual
grounding advantageous for abstract words, or is its effectiveness restricted
to concrete words? What is the optimal way of bridging the gap between text and
vision? To what extent is perceptual knowledge from images advantageous for
acquiring high-quality embeddings? Leveraging the current advances in machine
learning and natural language processing, the present study addresses these
questions by proposing a simple yet very effective computational grounding
model for pre-trained word embeddings. Our model effectively balances the
interplay between language and vision by aligning textual embeddings with
visual information while simultaneously preserving the distributional
statistics that characterize word usage in text corpora. By applying a learned
alignment, we are able to indirectly ground unseen words including abstract
words. A series of evaluations on a range of behavioural datasets shows that
visual grounding is beneficial not only for concrete words but also for
abstract words, lending support to the indirect theory of abstract concepts.
Moreover, our approach offers advantages for contextualized embeddings, such as
those generated by BERT, but only when trained on corpora of modest,
cognitively plausible sizes. Code and grounded embeddings for English are
available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2