153 research outputs found
Emergent inabilities? Inverse scaling over the course of pretraining
Does inverse scaling only occur as a function of model size, or can it also
occur over the course of training? We carry out an exploratory study
investigating whether the performance of language models on specific tasks can
decrease (while general performance remains high) during training on the
language modeling task. We find 8 tasks on which Pythia 12B (Biderman et al.,
2023) shows decreased performance over the course of training. Five of these
tasks (TruthfulQA-MC1, TruthfulQA-MC2, Hindsight Neglect, Memo Trap, and
Pattern Match Suppression) additionally show a consistent relationship whereby
larger language models show a greater decrease in performance the more they are
trained, despite showing standard (positive) scaling overall. This highlights
the importance of testing performance at all relevant benchmarks any time
models are trained on additional data, even if their overall performance
improvesComment: Accepted to Findings of EMNLP 202
Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?
Some languages allow arguments to be omitted in certain contexts. Yet human
language comprehenders reliably infer the intended referents of these zero
pronouns, in part because they construct expectations about which referents are
more likely. We ask whether Neural Language Models also extract the same
expectations. We test whether 12 contemporary language models display
expectations that reflect human behavior when exposed to sentences with zero
pronouns from five behavioral experiments conducted in Italian by Carminati
(2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the
human behavior from all the experiments, with others successfully modeling some
of the results. This result suggests that human expectations about coreference
can be derived from exposure to language, and also indicates features of
language models that allow them to better reflect human behavior.Comment: Accepted at COLING 202
Probability in Phonological Generalizations: Modeling French Optional Final Consonants
Proceedings of the Twenty-Sixth Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Aspect (2000
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
How should text dataset sizes be compared across languages? Even for
content-matched (parallel) corpora, UTF-8 encoded text can require a
dramatically different number of bytes for different languages. In our work, we
define the byte premium between two languages as the ratio of bytes used to
encode content-matched text in those languages. We compute byte premiums for
1155 languages, and we use linear regressions to estimate byte premiums for
other languages. We release a tool to obtain byte premiums for any two
languages, enabling comparisons of dataset sizes across languages for more
equitable multilingual model development and data practices
Can Peanuts Fall in Love with Distributional Semantics?
The context in which a sentence appears can drastically alter our
expectations about upcoming words - for example, following a short story
involving an anthropomorphic peanut, experimental participants are more likely
to expect the sentence 'the peanut was in love' than 'the peanut was salted',
as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This rapid and
dynamic updating of comprehenders' expectations about the kind of events that a
peanut may take part in based on context has been explained using the construct
of Situation Models - updated mental representations of key elements of an
event under discussion, in this case, the peanut protagonist. However, recent
work showing that N400 amplitude can be predicted based on distributional
information alone raises the question whether situation models are in fact
necessary for the kinds of contextual effects observed in previous work. To
investigate this question, we attempt to model the results of Nieuwland and van
Berkum (2006) using six computational language models and three sets of word
vectors, none of which have explicit situation models or semantic grounding. We
find that the effect found by Nieuwland and van Berkum (2006) can be fully
modeled by two language models and two sets of word vectors, with others
showing a reduced effect. Thus, at least some processing effects normally
explained through situation models may not in fact require explicit situation
models
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Multilingual language models are widely used to extend NLP systems to
low-resource languages. However, concrete evidence for the effects of
multilinguality on language modeling performance in individual languages
remains scarce. Here, we pre-train over 10,000 monolingual and multilingual
language models for over 250 languages, including multiple language families
that are under-studied in NLP. We assess how language modeling performance in
each language varies as a function of (1) monolingual dataset size, (2) added
multilingual dataset size, (3) linguistic similarity of the added languages,
and (4) model size (up to 45M parameters). We find that in moderation, adding
multilingual data improves low-resource language modeling performance, similar
to increasing low-resource dataset sizes by up to 33%. Improvements depend on
the syntactic similarity of the added multilingual data, with marginal
additional effects of vocabulary overlap. However, high-resource languages
consistently perform worse in multilingual pre-training scenarios. As dataset
sizes increase, adding multilingual data begins to hurt performance for both
low-resource and high-resource languages, likely due to limited model capacity
(the "curse of multilinguality"). These results suggest that massively
multilingual pre-training may not be optimal for any languages involved, but
that more targeted models can significantly improve performance
Recommended from our members
Left-right mental timeline is robust to visuospatial and verbal interference
We test the robustness of American college students’ mentaltimeline to dual tasks that have interfered with spatial andverbal reasoning in prior work. We focus on the left-right axisfor representing sequences of events. We test Americancollege students, who read from left to right. We test forautomatic space-time mappings using two established space-time association tasks. We find that their tendency toassociate earlier events with the left side of space and laterevents with the right remains under conditions of visuospatialand verbal interference. We find this both when participantsmade time judgments about linguistic and non-linguisticstimuli. We discuss the relationship between these results andthose obtained for mental timelines that result from learningnew metaphors in language (Hendricks & Boroditsky, 2015),and the effects of the same interference tasks on number tasks(mental number-line and counting; van Dijck et al., 2009;Frank et al., 2012)
- …