23 research outputs found

    Entity Tracking in Language Models

    Full text link
    Keeping track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. Yet, there have been few systematic investigations into the ability of large language models (LLMs) to track discourse entities. In this work, we present a task probing to what extent a language model can infer the final state of an entity given an English description of the initial state and a series of state-changing operations. We use this task to first investigate whether Flan-T5, GPT-3 and GPT-3.5 can track the state of entities, and find that only GPT-3.5 models, which have been pretrained on large amounts of code, exhibit this ability. We then investigate whether smaller models pretrained primarily on text can learn to track entities, through finetuning T5 on several training/evaluation splits. While performance degrades for more complex splits, we find that even when evaluated on a different set of entities from training or longer operation sequences, a finetuned model can perform non-trivial entity tracking. Taken together, these results suggest that language models can learn to track entities but pretraining on text corpora alone does not make this capacity surface.Comment: ACL 2023 Camera-read

    Abstraction via exemplars? A representational case study on lexical category inference in BERT

    Full text link
    Exemplar based accounts are often considered to be in direct opposition to pure linguistic abstraction in explaining language learners' ability to generalize to novel expressions. However, the recent success of neural network language models on linguistically sensitive tasks suggests that perhaps abstractions can arise via the encoding of exemplars. We provide empirical evidence for this claim by adapting an existing experiment that studies how an LM (BERT) generalizes the usage of novel tokens that belong to lexical categories such as Noun/Verb/Adjective/Adverb from exposure to only a single instance of their usage. We analyze the representational behavior of the novel tokens in these experiments, and find that BERT's capacity to generalize to unseen expressions involving the use of these novel tokens constitutes the movement of novel token representations towards regions of known category exemplars in two-dimensional space. Our results suggest that learners' encoding of exemplars can indeed give rise to abstraction like behavior.Comment: 2-page abstract, to appear in BUCLD4

    Entity tracking in language models

    Get PDF
    https://doi.org/10.18653/v1/2023.acl-long.213Published versio

    Compositional Linguistic Generalization in Artificial Neural Networks

    Get PDF
    Compositionality---the principle that the meaning of a complex expression is built from the meanings of its parts---is considered a central property of human language. This dissertation focuses on compositional generalization, a key benefit of compositionality that enables the production and comprehension of novel expressions. Specifically, this dissertation develops a test for compositional generalization for sequence-to-sequence artificial neural networks (ANNs). Before doing so, I start by developing a test for grammatical category abstraction: an important precondition to compositional generalization, because category membership determines the applicability of compositional rules. Then, I construct a test for compositional generalization based on human generalization patterns discussed in existing linguistic and developmental studies. The test takes the form of semantic parsing (translation from natural language expressions to semantic representations) where the training and generalization sets have systematic gaps that can be filled by composing known parts. The generalization cases fall into two broad categories: lexical and structural, depending on whether generalization to novel combinations of known lexical items and known structures is required, or generalization to novel structures is required. The ANNs evaluated on this test exhibit limited degrees of compositional generalization, implying that the inductive biases of the ANNs and human learners differ substantially. An error analysis reveals that all ANNs tested frequently make generalizations that violate faithfulness constraints (e.g., Emma saw Lina ↝ see'(Emma', Audrey') instead of see'(Emma', Lina')). Adding a glossing task (word-by-word translation)---a task that requires maximally faithful input-output mappings---as an auxiliary objective to the Transformer model (Vaswani et al. 2017) greatly improves generalization, demonstrating that a faithfulness bias can be injected through the auxiliary training approach. However, the improvement is limited to lexical generalization; all models struggle with assigning appropriate semantic representations to novel structures regardless of auxiliary training. This difficulty of structural generalization leaves open questions for both ANN and human learners. I discuss promising directions for improving structural generalization in ANNs, and furthermore propose an artificial language learning study for human subjects analogous to the tests posed to ANNs, which will lead to more detailed characterization of the patterns of structural generalization in human learners

    Inverse scaling can become U-shaped

    Get PDF
    https://doi.org/10.18653/v1/2023.emnlp-main.963Published versio

    Reconstruction probing

    Get PDF
    https://doi.org/10.18653/v1/2023.findings-acl.523Published versio

    (QA)2: Question answering with questionable assumptions: question answering with questionable assumptions

    Get PDF
    https://doi.org/10.18653/v1/2023.acl-long.472Published versio

    SLOG: a structural generalization benchmark for semantic parsing

    Get PDF
    http://10.0.72.221/v1/2023.emnlp-main.194Published versio

    LAMBADA: Backward Chaining for Automated Reasoning in Natural Language

    Get PDF
    Remarkable progress has been made on automated reasoning with knowledge specified as unstructured, natural text, by using the power of large language models (LMs) coupled with methods such as Chain-of-Thought prompting and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from the intended conclusion to the set of axioms that support it) is significantly more efficient at proof-finding problems. We import this intuition into the LM setting and develop a Backward Chaining algorithm, which we call LAMBADA, that decomposes reasoning into four sub-modules, each of which can be simply implemented by few-shot prompted LM inference. We show that LAMBADA achieves massive accuracy boosts over state-of-the-art forward reasoning methods on two challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.Comment: 16 page
    corecore