26 research outputs found
Psychometric Predictive Power of Large Language Models
Next-word probabilities from language models have been shown to successfully
simulate human reading behavior. Building on this, we show that, interestingly,
instruction-tuned large language models (LLMs) yield worse psychometric
predictive power (PPP) for human reading behavior than base LLMs with
equivalent perplexities. In other words, instruction tuning, which helps LLMs
provide human-preferred responses, does not always make them human-like from
the computational psycholinguistics perspective. In addition, we explore
prompting methodologies in simulating human reading behavior with LLMs, showing
that prompts reflecting a particular linguistic hypothesis lead LLMs to exhibit
better PPP but are still worse than base LLMs. These highlight that recent
instruction tuning and prompting do not offer better estimates than direct
probability measurements from base LLMs in cognitive modeling.Comment: 8 page
Context Limitations Make Neural Language Models More Human-Like
Language models (LMs) have been used in cognitive modeling as well as
engineering studies -- they compute information-theoretic complexity metrics
that simulate humans' cognitive load during reading. This study highlights a
limitation of modern neural LMs as the model of choice for this purpose: there
is a discrepancy between their context access capacities and that of humans.
Our results showed that constraining the LMs' context access improved their
simulation of human reading behavior. We also showed that LM-human gaps in
context access were associated with specific syntactic constructions;
incorporating syntactic biases into LMs' context access might enhance their
cognitive plausibility.Comment: Accepted by EMNLP2022 (main long
Transformer Language Models Handle Word Frequency in Prediction Head
Prediction head is a crucial component of Transformer language models.
Despite its direct impact on prediction, this component has often been
overlooked in analyzing Transformers. In this study, we investigate the inner
workings of the prediction head, specifically focusing on bias parameters. Our
experiments with BERT and GPT-2 models reveal that the biases in their word
prediction heads play a significant role in the models' ability to reflect word
frequency in a corpus, aligning with the logit adjustment method commonly used
in long-tailed learning. We also quantify the effect of controlling the biases
in practical auto-regressive text generation scenarios; under a particular
setting, more diverse text can be generated without compromising text quality.Comment: 11 pages, 12 figures, accepted to ACL 2023 Findings (short paper
Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism
Large language models (LLMs) take advantage of step-by-step reasoning
instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their
ability to perform CoT-style reasoning robustly is of interest from a probing
perspective. In this study, we inspect the step-by-step reasoning ability of
LLMs with a focus on negation, which is a core linguistic phenomenon that is
difficult to process. In particular, we introduce several controlled settings
(e.g., reasoning in case of fictional entities) to evaluate the logical
reasoning abilities of the models. We observed that dozens of modern LLMs were
not robust against lexical negation (e.g., plausible ->implausible) when
performing CoT-style reasoning, and the results highlight unique limitations in
each LLM family
Lower Perplexity is Not Always Human-Like
In computational psycholinguistics, various language models have been
evaluated against human reading behavior (e.g., eye movement) to build
human-like computational models. However, most previous efforts have focused
almost exclusively on English, despite the recent trend towards linguistic
universal within the general community. In order to fill the gap, this paper
investigates whether the established results in computational psycholinguistics
can be generalized across languages. Specifically, we re-examine an established
generalization -- the lower perplexity a language model has, the more
human-like the language model is -- in Japanese with typologically different
structures from English. Our experiments demonstrate that this established
generalization exhibits a surprising lack of universality; namely, lower
perplexity is not always human-like. Moreover, this discrepancy between English
and Japanese is further explored from the perspective of (non-)uniform
information density. Overall, our results suggest that a cross-lingual
evaluation will be necessary to construct human-like computational models.Comment: Accepted by ACL 202