140 research outputs found

    The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

    Full text link
    Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Overall, humans are correct more often than out-of-the-box models, and the models are sometimes right for the wrong reasons. Finally, we show that fine-tuning on a large, task-specific dataset can offer a solution to these issues.Comment: ACL 202

    Causal interventions expose implicit situation models for commonsense language understanding

    Full text link
    Accounts of human language processing have long appealed to implicit ``situation models'' that enrich comprehension with relevant but unstated world knowledge. Here, we apply causal intervention techniques to recent transformer models to analyze performance on the Winograd Schema Challenge (WSC), where a single context cue shifts interpretation of an ambiguous pronoun. We identify a relatively small circuit of attention heads that are responsible for propagating information from the context word that guides which of the candidate noun phrases the pronoun ultimately attends to. We then compare how this circuit behaves in a closely matched ``syntactic'' control where the situation model is not strictly necessary. These analyses suggest distinct pathways through which implicit situation models are constructed to guide pronoun resolution.Comment: Findings of AC

    BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

    Full text link
    The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models

    Event knowledge in large language models: the gap between the impossible and the unlikely

    Full text link
    Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.Comment: The two lead authors have contributed equally to this wor

    Assessing vulnerability for climate adaptation

    Get PDF

    Evaluating and improving lexical language understanding in neural machine translation

    Get PDF
    Lexical understanding is an inalienable component of the translation process. In order to correctly map the meaning of a linguistic unit to the appropriate target language expression, the meaning of its constituent words has first to be identified and disambiguated, followed by the application of compositional operations. This thesis examines the competency of contemporary neural machine translation (NMT) models on two core aspects of lexical understanding – word sense disambiguation (WSD) and coreference resolution (CoR), both of which are well-established and much-studied natural language processing (NLP) tasks. Certain linguistic properties that are under-specified in a source language (e.g. the grammatical gender of a noun in English) may need to be stated explicitly in the chosen target language (e.g. German). Doing so correctly requires the accurate resolution of the associated ambiguities. While recent modeling advances appear to suggest that both WSD and CoR are largely solved challenges in machine translation, the work conducted within the scope of this thesis demonstrates that this is not yet the case. In particular, we show that NMT systems are prone to relying on surface-level heuristics and data biases to guide their lexical disambiguation decisions, rather than engaging in deep language understanding by correctly recognizing and leveraging contextual disambiguation triggers. As part of our investigation, we introduce a novel methodology for predicting WSD errors a translation model is likely to make and utilize this knowledge to craft adversarial attacks with the aim to elicit disambiguation errors in model translations. Additionally, we create a set of challenging CoR benchmarks that uncover the inability of translation systems to identify referents of pronouns in contexts that presuppose commonsense reasoning, caused by their pathological over-reliance on data biases. At the same time, we develop initial solutions for the identified model deficiencies. As such, we show that fine-tuning on de-biased data and modifying the learning objective of a model can significantly improve disambiguation performance by counteracting the harmful impact of data biases. We furthermore propose a novel extension to the popular transformer architecture that is found to strengthen its WSD capabilities and robustness to adversarial WSD attacks by facilitating the accessibility of lexical features across all layers of the model and increasing the extent to which contextual information is encapsulated with its latent representations. Despite the so effected improvements to WSD and CoR, both tasks remain far from solved, posing a veritable challenge for the current generation of NMT models, as well as for large language models that have risen to prominence within NLP in recent years

    Survey on Sociodemographic Bias in Natural Language Processing

    Full text link
    Deep neural networks often learn unintended biases during training, which might have harmful effects when deployed in real-world settings. This paper surveys 209 papers on bias in NLP models, most of which address sociodemographic bias. To better understand the distinction between bias and real-world harm, we turn to ideas from psychology and behavioral economics to propose a definition for sociodemographic bias. We identify three main categories of NLP bias research: types of bias, quantifying bias, and debiasing. We conclude that current approaches on quantifying bias face reliability issues, that many of the bias metrics do not relate to real-world biases, and that current debiasing techniques are superficial and hide bias rather than removing it. Finally, we provide recommendations for future work.Comment: 23 pages, 1 figur

    Improving BERT with Self-Supervised Attention

    Full text link
    One of the most popular paradigms of applying large pre-trained NLP models such as BERT is to fine-tune it on a smaller dataset. However, one challenge remains as the fine-tuned model often overfits on smaller datasets. A symptom of this phenomenon is that irrelevant or misleading words in the sentence, which are easy to understand for human beings, can substantially degrade the performance of these finetuned BERT models. In this paper, we propose a novel technique, called Self-Supervised Attention (SSA) to help facilitate this generalization challenge. Specifically, SSA automatically generates weak, token-level attention labels iteratively by probing the fine-tuned model from the previous iteration. We investigate two different ways of integrating SSA into BERT and propose a hybrid approach to combine their benefits. Empirically, through a variety of public datasets, we illustrate significant performance improvement using our SSA-enhanced BERT model
    • …
    corecore