329 research outputs found
Does GPT-4 Pass the Turing Test?
We evaluated GPT-4 in a public online Turing Test. The best-performing GPT-4
prompt passed in 41% of games, outperforming baselines set by ELIZA (27%) and
GPT-3.5 (14%), but falling short of chance and the baseline set by human
participants (63%). Participants' decisions were based mainly on linguistic
style (35%) and socio-emotional traits (27%), supporting the idea that
intelligence is not sufficient to pass the Turing Test. Participants'
demographics, including education and familiarity with LLMs, did not predict
detection rate, suggesting that even those who understand systems deeply and
interact with them frequently may be susceptible to deception. Despite known
limitations as a test of intelligence, we argue that the Turing Test continues
to be relevant as an assessment of naturalistic communication and deception. AI
models with the ability to masquerade as humans could have widespread societal
consequences, and we analyse the effectiveness of different strategies and
criteria for judging humanlikeness.Comment: 25 pages, 21 figure
When Little Things Mean a Lot: On the Inefficiency of Item Pricing Laws
We study item-pricing laws (which require that each item in a store be individually marked with a price sticker) and examine and quantify their costs and benefits. On the cost side, we argue that item-pricing laws increase the retailers’ costs, forcing them to raise prices. We test this prediction using data on retail prices from large supermarket chains in the Tri-State area of New York, New Jersey and Connecticut. The Tri-States offer a unique setting—a natural experiment—to study item-pricing laws because the States vary in their use of item-pricing laws, but otherwise offer similar markets and chains operating in a close proximity to each other in a relatively homogenous socioeconomic environment. We use two datasets, one emphasizing the breadth in coverage across products and the other across stores. We find consistent evidence across products, product categories, stores, chains, states, and sampling periods, that the prices at stores facing item-pricing laws are higher than the prices at stores not facing the item pricing laws by about 25¢ or 9.6% per item. We also have data from supermarket chains that would be subject to item-pricing laws but are exempted from item pricing requirement because they use costly electronic shelf label systems. Using this data as a control, we find that the electronic shelf label store prices fall between the item-pricing law and non-item- pricing law store prices: they are lower than the item-pricing law store prices by about 15¢ per item on average, but are higher than the non- item-pricing law store prices by about 10¢ per item on average. On the benefit side, we study the frequency and the magnitude of supermarket pricing errors, which the item-pricing laws are supposed to prevent. We quantify the benefits of the IPLs by conservatively assuming that they successfully accomplish their mission of preventing all price mistakes. Comparing the costs of item-pricing laws to their benefits, we find that the item-pricing law costs are at least an order of magnitude higher than the benefits.Item Pricing Laws, Costs of Item Pricing Laws, Benefits of Item Pricing Laws, Cost of Price Adjustment, Pricing Accuracy, Electronic Shelf Label System, Pricing Regulation, Cost of Pricing, Supermarket Chains
Emergent inabilities? Inverse scaling over the course of pretraining
Does inverse scaling only occur as a function of model size, or can it also
occur over the course of training? We carry out an exploratory study
investigating whether the performance of language models on specific tasks can
decrease (while general performance remains high) during training on the
language modeling task. We find 8 tasks on which Pythia 12B (Biderman et al.,
2023) shows decreased performance over the course of training. Five of these
tasks (TruthfulQA-MC1, TruthfulQA-MC2, Hindsight Neglect, Memo Trap, and
Pattern Match Suppression) additionally show a consistent relationship whereby
larger language models show a greater decrease in performance the more they are
trained, despite showing standard (positive) scaling overall. This highlights
the importance of testing performance at all relevant benchmarks any time
models are trained on additional data, even if their overall performance
improvesComment: Accepted to Findings of EMNLP 202
Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?
Some languages allow arguments to be omitted in certain contexts. Yet human
language comprehenders reliably infer the intended referents of these zero
pronouns, in part because they construct expectations about which referents are
more likely. We ask whether Neural Language Models also extract the same
expectations. We test whether 12 contemporary language models display
expectations that reflect human behavior when exposed to sentences with zero
pronouns from five behavioral experiments conducted in Italian by Carminati
(2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the
human behavior from all the experiments, with others successfully modeling some
of the results. This result suggests that human expectations about coreference
can be derived from exposure to language, and also indicates features of
language models that allow them to better reflect human behavior.Comment: Accepted at COLING 202
Can Peanuts Fall in Love with Distributional Semantics?
The context in which a sentence appears can drastically alter our
expectations about upcoming words - for example, following a short story
involving an anthropomorphic peanut, experimental participants are more likely
to expect the sentence 'the peanut was in love' than 'the peanut was salted',
as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This rapid and
dynamic updating of comprehenders' expectations about the kind of events that a
peanut may take part in based on context has been explained using the construct
of Situation Models - updated mental representations of key elements of an
event under discussion, in this case, the peanut protagonist. However, recent
work showing that N400 amplitude can be predicted based on distributional
information alone raises the question whether situation models are in fact
necessary for the kinds of contextual effects observed in previous work. To
investigate this question, we attempt to model the results of Nieuwland and van
Berkum (2006) using six computational language models and three sets of word
vectors, none of which have explicit situation models or semantic grounding. We
find that the effect found by Nieuwland and van Berkum (2006) can be fully
modeled by two language models and two sets of word vectors, with others
showing a reduced effect. Thus, at least some processing effects normally
explained through situation models may not in fact require explicit situation
models
Probability in Phonological Generalizations: Modeling French Optional Final Consonants
Proceedings of the Twenty-Sixth Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Aspect (2000
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Multilingual language models are widely used to extend NLP systems to
low-resource languages. However, concrete evidence for the effects of
multilinguality on language modeling performance in individual languages
remains scarce. Here, we pre-train over 10,000 monolingual and multilingual
language models for over 250 languages, including multiple language families
that are under-studied in NLP. We assess how language modeling performance in
each language varies as a function of (1) monolingual dataset size, (2) added
multilingual dataset size, (3) linguistic similarity of the added languages,
and (4) model size (up to 45M parameters). We find that in moderation, adding
multilingual data improves low-resource language modeling performance, similar
to increasing low-resource dataset sizes by up to 33%. Improvements depend on
the syntactic similarity of the added multilingual data, with marginal
additional effects of vocabulary overlap. However, high-resource languages
consistently perform worse in multilingual pre-training scenarios. As dataset
sizes increase, adding multilingual data begins to hurt performance for both
low-resource and high-resource languages, likely due to limited model capacity
(the "curse of multilinguality"). These results suggest that massively
multilingual pre-training may not be optimal for any languages involved, but
that more targeted models can significantly improve performance
- …