9 research outputs found
Emergent inabilities? Inverse scaling over the course of pretraining
Does inverse scaling only occur as a function of model size, or can it also
occur over the course of training? We carry out an exploratory study
investigating whether the performance of language models on specific tasks can
decrease (while general performance remains high) during training on the
language modeling task. We find 8 tasks on which Pythia 12B (Biderman et al.,
2023) shows decreased performance over the course of training. Five of these
tasks (TruthfulQA-MC1, TruthfulQA-MC2, Hindsight Neglect, Memo Trap, and
Pattern Match Suppression) additionally show a consistent relationship whereby
larger language models show a greater decrease in performance the more they are
trained, despite showing standard (positive) scaling overall. This highlights
the importance of testing performance at all relevant benchmarks any time
models are trained on additional data, even if their overall performance
improvesComment: Accepted to Findings of EMNLP 202
Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?
Some languages allow arguments to be omitted in certain contexts. Yet human
language comprehenders reliably infer the intended referents of these zero
pronouns, in part because they construct expectations about which referents are
more likely. We ask whether Neural Language Models also extract the same
expectations. We test whether 12 contemporary language models display
expectations that reflect human behavior when exposed to sentences with zero
pronouns from five behavioral experiments conducted in Italian by Carminati
(2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the
human behavior from all the experiments, with others successfully modeling some
of the results. This result suggests that human expectations about coreference
can be derived from exposure to language, and also indicates features of
language models that allow them to better reflect human behavior.Comment: Accepted at COLING 202
Can Peanuts Fall in Love with Distributional Semantics?
The context in which a sentence appears can drastically alter our
expectations about upcoming words - for example, following a short story
involving an anthropomorphic peanut, experimental participants are more likely
to expect the sentence 'the peanut was in love' than 'the peanut was salted',
as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This rapid and
dynamic updating of comprehenders' expectations about the kind of events that a
peanut may take part in based on context has been explained using the construct
of Situation Models - updated mental representations of key elements of an
event under discussion, in this case, the peanut protagonist. However, recent
work showing that N400 amplitude can be predicted based on distributional
information alone raises the question whether situation models are in fact
necessary for the kinds of contextual effects observed in previous work. To
investigate this question, we attempt to model the results of Nieuwland and van
Berkum (2006) using six computational language models and three sets of word
vectors, none of which have explicit situation models or semantic grounding. We
find that the effect found by Nieuwland and van Berkum (2006) can be fully
modeled by two language models and two sets of word vectors, with others
showing a reduced effect. Thus, at least some processing effects normally
explained through situation models may not in fact require explicit situation
models
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models
Abstract grammatical knowledge - of parts of speech and grammatical patterns
- is key to the capacity for linguistic generalization in humans. But how
abstract is grammatical knowledge in large language models? In the human
literature, compelling evidence for grammatical abstraction comes from
structural priming. A sentence that shares the same grammatical structure as a
preceding sentence is processed and produced more readily. Because confounds
exist when using stimuli in a single language, evidence of abstraction is even
more compelling from crosslingual structural priming, where use of a syntactic
structure in one language primes an analogous structure in another language. We
measure crosslingual structural priming in large language models, comparing
model behavior to human experimental results from eight crosslingual
experiments covering six languages, and four monolingual structural priming
experiments in three non-English languages. We find evidence for abstract
monolingual and crosslingual grammatical representations in the models that
function similarly to those found in humans. These results demonstrate that
grammatical representations in multilingual language models are not only
similar across languages, but they can causally influence text produced in
different languages.Comment: Accepted at EMNLP 202
So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements
More predictable words are easier to process - they are read faster and
elicit smaller neural signals associated with processing difficulty, most
notably, the N400 component of the event-related brain potential. Thus, it has
been argued that prediction of upcoming words is a key component of language
comprehension, and that studying the amplitude of the N400 is a valuable way to
investigate the predictions we make. In this study, we investigate whether the
linguistic predictions of computational language models or humans better
reflect the way in which natural language stimuli modulate the amplitude of the
N400. One important difference in the linguistic predictions of humans versus
computational language models is that while language models base their
predictions exclusively on the preceding linguistic context, humans may rely on
other factors. We find that the predictions of three top-of-the-line
contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more
closely than human predictions. This suggests that the predictive processes
underlying the N400 may be more sensitive to the surface-level statistics of
language than previously thought.Comment: Accepte
Recommended from our members
Different kinds of cognitive plausibility: why are transformers better than RNNs at predicting N400 amplitude?
Despite being designed for performance rather than cognitive plausibility, transformer language models have been found to be better at predicting metrics used to assess human language comprehension than language models with other architectures, such as recurrent neural networks. Based on how well they predict the N400, a neural signal associated with processing difficulty, we propose and provide evidence for one possible explanation—their predictions are affected by the preceding context in a way analogous to the effect of semantic facilitation in humans
Different kinds of cognitive plausibility: why are transformers better than RNNs at predicting N400 amplitude?
Despite being designed for performance rather than cognitive plausibility, transformer language models have been found to be better at predicting metrics used to assess human language comprehension than language models with other architectures, such as recurrent neural networks. Based on how well they predict the N400, a neural signal associated with processing difficulty, we propose and provide evidence for one possible explanation—their predictions are affected by the preceding context in a way analogous to the effect of semantic facilitation in humans
Recommended from our members
Distrubutional Semantics Still Can't Account for Affordances
Can we know a word by the company it keeps? Aspects of meaning that concern physical interactions might be particularly difficult to learn from language alone. Glenberg & Robertson (2000) found that although human comprehenders were sensitive to the distinction between afforded and nonafforded actions, distributional semantic models were not. We tested whether technological advances have made distributional models more sensitive to affordances by replicating their experiment with modern Neural Language Models (NLMs). We found that only one NLM (GPT-3) was sensitive to the affordedness of actions. Moreover, GPT-3 accounted for only one third of the effect of affordedness on human sensibility judgments. These results imply that people use processes that go beyond distributional statistics to understand linguistic expressions, and that NLP systems may need to be augmented with such capabilities
Recommended from our members
Strong Prediction: Language Model Surprisal Explains Multiple N400 Effects
Abstract
Theoretical accounts of the N400 are divided as to whether the amplitude of the N400 response to a stimulus reflects the extent to which the stimulus was predicted, the extent to which the stimulus is semantically similar to its preceding context, or both. We use state-of-the-art machine learning tools to investigate which of these three accounts is best supported by the evidence. GPT-3, a neural language model trained to compute the conditional probability of any word based on the words that precede it, was used to operationalize contextual predictability. In particular, we used an information-theoretic construct known as surprisal (the negative logarithm of the conditional probability). Contextual semantic similarity was operationalized by using two high-quality co-occurrence-derived vector-based meaning representations for words: GloVe and fastText. The cosine between the vector representation of the sentence frame and final word was used to derive contextual cosine similarity estimates. A series of regression models were constructed, where these variables, along with cloze probability and plausibility ratings, were used to predict single trial N400 amplitudes recorded from healthy adults as they read sentences whose final word varied in its predictability, plausibility, and semantic relationship to the likeliest sentence completion. Statistical model comparison indicated GPT-3 surprisal provided the best account of N400 amplitude and suggested that apparently disparate N400 effects of expectancy, plausibility, and contextual semantic similarity can be reduced to variation in the predictability of words. The results are argued to support predictive coding in the human language network