Search CORE

9 research outputs found

Emergent inabilities? Inverse scaling over the course of pretraining

Author: Bergen Benjamin K.
Michaelov James A.
Publication venue
Publication date: 15/11/2023
Field of study

Does inverse scaling only occur as a function of model size, or can it also occur over the course of training? We carry out an exploratory study investigating whether the performance of language models on specific tasks can decrease (while general performance remains high) during training on the language modeling task. We find 8 tasks on which Pythia 12B (Biderman et al., 2023) shows decreased performance over the course of training. Five of these tasks (TruthfulQA-MC1, TruthfulQA-MC2, Hindsight Neglect, Memo Trap, and Pattern Match Suppression) additionally show a consistent relationship whereby larger language models show a greater decrease in performance the more they are trained, despite showing standard (positive) scaling overall. This highlights the importance of testing performance at all relevant benchmarks any time models are trained on additional data, even if their overall performance improvesComment: Accepted to Findings of EMNLP 202

arXiv.org e-Print Archive

Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Author: Bergen Benjamin K.
Michaelov James A.
Publication venue
Publication date: 03/10/2022
Field of study

Some languages allow arguments to be omitted in certain contexts. Yet human language comprehenders reliably infer the intended referents of these zero pronouns, in part because they construct expectations about which referents are more likely. We ask whether Neural Language Models also extract the same expectations. We test whether 12 contemporary language models display expectations that reflect human behavior when exposed to sentences with zero pronouns from five behavioral experiments conducted in Italian by Carminati (2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the human behavior from all the experiments, with others successfully modeling some of the results. This result suggests that human expectations about coreference can be derived from exposure to language, and also indicates features of language models that allow them to better reflect human behavior.Comment: Accepted at COLING 202

arXiv.org e-Print Archive

Can Peanuts Fall in Love with Distributional Semantics?

Author: Bergen Benjamin K.
Coulson Seana
Michaelov James A.
Publication venue
Publication date: 01/01/2023
Field of study

The context in which a sentence appears can drastically alter our expectations about upcoming words - for example, following a short story involving an anthropomorphic peanut, experimental participants are more likely to expect the sentence 'the peanut was in love' than 'the peanut was salted', as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This rapid and dynamic updating of comprehenders' expectations about the kind of events that a peanut may take part in based on context has been explained using the construct of Situation Models - updated mental representations of key elements of an event under discussion, in this case, the peanut protagonist. However, recent work showing that N400 amplitude can be predicted based on distributional information alone raises the question whether situation models are in fact necessary for the kinds of contextual effects observed in previous work. To investigate this question, we attempt to model the results of Nieuwland and van Berkum (2006) using six computational language models and three sets of word vectors, none of which have explicit situation models or semantic grounding. We find that the effect found by Nieuwland and van Berkum (2006) can be fully modeled by two language models and two sets of word vectors, with others showing a reduced effect. Thus, at least some processing effects normally explained through situation models may not in fact require explicit situation models

arXiv.org e-Print Archive

eScholarship - University of California

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Author: Arnett Catherine
Bergen Benjamin K.
Chang Tyler A.
Michaelov James A.
Publication venue
Publication date: 15/11/2023
Field of study

Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is processed and produced more readily. Because confounds exist when using stimuli in a single language, evidence of abstraction is even more compelling from crosslingual structural priming, where use of a syntactic structure in one language primes an analogous structure in another language. We measure crosslingual structural priming in large language models, comparing model behavior to human experimental results from eight crosslingual experiments covering six languages, and four monolingual structural priming experiments in three non-English languages. We find evidence for abstract monolingual and crosslingual grammatical representations in the models that function similarly to those found in humans. These results demonstrate that grammatical representations in multilingual language models are not only similar across languages, but they can causally influence text produced in different languages.Comment: Accepted at EMNLP 202

arXiv.org e-Print Archive

So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements

Author: Bergen Benjamin K.
Coulson Seana
Michaelov James A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/05/2022
Field of study

More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought.Comment: Accepte

arXiv.org e-Print Archive

Recommended from our members

Different kinds of cognitive plausibility: why are transformers better than RNNs at predicting N400 amplitude?

Author: Bardolph Megan D
Bergen Benjamin
Coulson Seana
Michaelov James A
Publication venue: eScholarship, University of California
Publication date: 01/01/2021
Field of study

Despite being designed for performance rather than cognitive plausibility, transformer language models have been found to be better at predicting metrics used to assess human language comprehension than language models with other architectures, such as recurrent neural networks. Based on how well they predict the N400, a neural signal associated with processing difficulty, we propose and provide evidence for one possible explanation—their predictions are affected by the preceding context in a way analogous to the effect of semantic facilitation in humans

eScholarship - University of California

Different kinds of cognitive plausibility: why are transformers better than RNNs at predicting N400 amplitude?

Author: Bardolph Megan D
Bergen Benjamin
Coulson Seana
Michaelov James A
Publication venue: eScholarship, University of California
Publication date: 01/01/2021
Field of study

arXiv.org e-Print Archive

eScholarship - University of California

Recommended from our members

Distrubutional Semantics Still Can't Account for Affordances

Author: Bergen Benjamin
Chang Tyler A
Coulson Seana
Jones Cameron R
Michaelov James A
Trott Sean
Publication venue: eScholarship, University of California
Publication date: 01/01/2022
Field of study

Can we know a word by the company it keeps? Aspects of meaning that concern physical interactions might be particularly difficult to learn from language alone. Glenberg & Robertson (2000) found that although human comprehenders were sensitive to the distinction between afforded and nonafforded actions, distributional semantic models were not. We tested whether technological advances have made distributional models more sensitive to affordances by replicating their experiment with modern Neural Language Models (NLMs). We found that only one NLM (GPT-3) was sensitive to the affordedness of actions. Moreover, GPT-3 accounted for only one third of the effect of affordedness on human sensibility judgments. These results imply that people use processes that go beyond distributional statistics to understand linguistic expressions, and that NLP systems may need to be augmented with such capabilities

eScholarship - University of California

Recommended from our members

Strong Prediction: Language Model Surprisal Explains Multiple N400 Effects

Author: Bardolph Megan D
Bergen Benjamin K
Coulson Seana
Michaelov James A
Van Petten Cyma K
Publication venue: eScholarship, University of California
Publication date: 11/06/2023
Field of study

Abstract Theoretical accounts of the N400 are divided as to whether the amplitude of the N400 response to a stimulus reflects the extent to which the stimulus was predicted, the extent to which the stimulus is semantically similar to its preceding context, or both. We use state-of-the-art machine learning tools to investigate which of these three accounts is best supported by the evidence. GPT-3, a neural language model trained to compute the conditional probability of any word based on the words that precede it, was used to operationalize contextual predictability. In particular, we used an information-theoretic construct known as surprisal (the negative logarithm of the conditional probability). Contextual semantic similarity was operationalized by using two high-quality co-occurrence-derived vector-based meaning representations for words: GloVe and fastText. The cosine between the vector representation of the sentence frame and final word was used to derive contextual cosine similarity estimates. A series of regression models were constructed, where these variables, along with cloze probability and plausibility ratings, were used to predict single trial N400 amplitudes recorded from healthy adults as they read sentences whose final word varied in its predictability, plausibility, and semantic relationship to the likeliest sentence completion. Statistical model comparison indicated GPT-3 surprisal provided the best account of N400 amplitude and suggested that apparently disparate N400 effects of expectancy, plausibility, and contextual semantic similarity can be reduced to variation in the predictability of words. The results are argued to support predictive coding in the human language network

eScholarship - University of California