114 research outputs found
Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large Language Models During Predictive Language Processing
The advanced language processing abilities of large language models (LLMs)
have stimulated debate over their capacity to replicate human-like cognitive
processes. One differentiating factor between language processing in LLMs and
humans is that language input is often grounded in several perceptual
modalities, whereas most LLMs process solely text-based information. Multimodal
grounding allows humans to integrate - e.g. visual context with linguistic
information and thereby place constraints on the space of upcoming words,
reducing cognitive load and improving comprehension. Recent multimodal LLMs
(mLLMs) combine a visual-linguistic embedding space with a transformer type
attention mechanism for next-word prediction. Here we ask whether predictive
language processing based on multimodal input in mLLMs aligns with humans.
Two-hundred participants watched short audio-visual clips and estimated
predictability of an upcoming verb or noun. The same clips were processed by
the mLLM CLIP, with predictability scores based on comparing image and text
feature vectors. Eye-tracking was used to estimate what visual features
participants attended to, and CLIP's visual attention weights were recorded. We
find that alignment of predictability scores was driven by multimodality of
CLIP (no alignment for a unimodal state-of-the-art LLM) and by the attention
mechanism (no alignment when attention weights were perturbated or when the
same input was fed to a multimodal model without attention). We further find a
significant spatial overlap between CLIP's visual attention weights and human
eye-tracking data. Results suggest that comparable processes of integrating
multimodal information, guided by attention to relevant visual features,
supports predictive language processing in mLLMs and humans.Comment: 13 pages, 4 figures, submitted to journa
- …