45 research outputs found
Can language models handle recursively nested grammatical structures? A case study on comparing models and humans
How should we compare the capabilities of language models (LMs) and humans? I
draw inspiration from comparative psychology to highlight some challenges. In
particular, I consider a case study: processing of recursively nested
grammatical structures. Prior work suggests that LMs cannot handle these
structures as reliably as humans can. However, the humans were provided with
instructions and training, while the LMs were evaluated zero-shot. I therefore
match the evaluation more closely. Providing large LMs with a simple prompt --
substantially less content than the human training -- allows the LMs to
consistently outperform the human results, and even to extrapolate to more
deeply nested conditions than were tested with humans. Further, reanalyzing the
prior human data suggests that the humans may not perform above chance at the
difficult structures initially. Thus, large LMs may indeed process recursively
nested grammatical structures as reliably as humans. This case study highlights
how discrepancies in the evaluation can confound comparisons of language models
and humans. I therefore reflect on the broader challenge of comparing human and
model capabilities, and highlight an important difference between evaluating
cognitive models and foundation models
Know your audience: specializing grounded language models with listener subtraction
Effective communication requires adapting to the idiosyncrasies of each
communicative context--such as the common ground shared with each partner.
Humans demonstrate this ability to specialize to their audience in many
contexts, such as the popular game Dixit. We take inspiration from Dixit to
formulate a multi-agent image reference game where a (trained) speaker model is
rewarded for describing a target image such that one (pretrained) listener
model can correctly identify it among distractors, but another listener cannot.
To adapt, the speaker must exploit differences in the knowledge it shares with
the different listeners. We show that finetuning an attention-based adapter
between a CLIP vision encoder and a large language model in this contrastive,
multi-agent setting gives rise to context-dependent natural language
specialization from rewards only, without direct supervision. Through
controlled experiments, we show that training a speaker with two listeners that
perceive differently, using our method, allows the speaker to adapt to the
idiosyncracies of the listeners. Furthermore, we show zero-shot transfer of the
specialization to real-world data. Our experiments demonstrate a method for
specializing grounded language models without direct supervision and highlight
the interesting research challenges posed by complex multi-agent communication.Comment: 28 pages, 9 figure
Know your audience: specializing grounded language models with listener subtraction
Effective communication requires adapting
to the idiosyncrasies of each communicative
context—such as the common ground shared
with each partner. Humans demonstrate this
ability to specialize to their audience in many
contexts, such as the popular game Dixit. We
take inspiration from Dixit to formulate a multiagent image reference game where a (trained)
speaker model is rewarded for describing a target image such that one (pretrained) listener
model can correctly identify it among distractors, but another listener cannot. To adapt, the
speaker must exploit differences in the knowledge it shares with the different listeners. We
show that finetuning an attention-based adapter
between a CLIP vision encoder and a large language model in this contrastive, multi-agent
setting gives rise to context-dependent natural language specialization from rewards only,
without direct supervision. Through controlled experiments, we show that training a speaker with two listeners that perceive differently, using our method, allows the speaker to adapt to the idiosyncracies of the listeners. Furthermore, we show zero-shot transfer of the specialization to real-world data. Our experiments demonstrate a method for specializing grounded language models without direct supervision and highlight the interesting research challenges posed by complex multi-agent communicatio
Evaluating Spatial Understanding of Large Language Models
Large language models (LLMs) show remarkable capabilities across a variety of
tasks. Despite the models only seeing text in training, several recent studies
suggest that LLM representations implicitly capture aspects of the underlying
grounded concepts. Here, we explore LLM representations of a particularly
salient kind of grounded knowledge -- spatial relationships. We design
natural-language navigation tasks and evaluate the ability of LLMs, in
particular GPT-3.5-turbo, GPT-4, and Llama2 series models, to represent and
reason about spatial structures, and compare these abilities to human
performance on the same tasks. These tasks reveal substantial variability in
LLM performance across different spatial structures, including square,
hexagonal, and triangular grids, rings, and trees. We also discover that,
similar to humans, LLMs utilize object names as landmarks for maintaining
spatial maps. Finally, in extensive error analysis, we find that LLMs' mistakes
reflect both spatial and non-spatial factors. These findings suggest that LLMs
appear to capture certain aspects of spatial structure implicitly, but room for
improvement remains