102 research outputs found

    In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search

    Full text link
    Since large language models have approached human-level performance on many tasks, it has become increasingly harder for researchers to find tasks that are still challenging to the models. Failure cases usually come from the long-tail distribution - data that an oracle language model could assign a probability on the lower end of its distribution. Current methodology such as prompt engineering or crowdsourcing are insufficient for creating long-tail examples because humans are constrained by cognitive bias. We propose a Logic-Induced-Knowledge-Search (LINK) framework for systematically generating long-tail knowledge statements. Grounded by a symbolic rule, we search for long-tail values for each variable of the rule by first prompting a LLM, then verifying the correctness of the values with a critic, and lastly pushing for the long-tail distribution with a reranker. With this framework we construct a dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and 50K knowledge statements spanning across four domains. Human annotations find that 84% of the statements in LINT are factually correct. In contrast, ChatGPT and GPT4 struggle with directly generating long-tail statements under the guidance of logic rules, each only getting 56% and 78% of their statements correct. Moreover, their "long-tail" generations in fact fall into the higher likelihood range, and thus are not really long-tail. Our findings suggest that LINK is effective for generating data in the long-tail distribution while enforcing quality. LINT can be useful for systematically evaluating LLMs' capabilities in the long-tail distribution. We challenge the models with a simple entailment classification task using samples from LINT. We find that ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in the long-tail distribution compared to head distribution

    A Systematic Investigation of Commonsense Knowledge in Large Language Models

    Full text link
    Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge -- a critical component of many NLP applications. We conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pre-trained LMs, where we: (i) carefully control for the LMs' ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation are insufficient to achieve human-level commonsense performance.Comment: Accepted to EMNLP 202

    Critical Role of Leucine-Valine Change in Distinct Low pH Requirements for Membrane Fusion between Two Related Retrovirus Envelopes

    Get PDF
    Many viruses use a pH-dependent pathway for fusion with host cell membrane, the mechanism of which is still poorly understood. Here we report that a subtle leucine (Leu)-valine (Val) change at position 501 in the envelope glycoproteins (Envs) of two related retroviruses, jaagsiekte sheep retro-virus (JSRV) and enzootic nasal tumor virus (ENTV), is responsible for their distinct low pH requirements for membrane fusion and infection. The Leu and Val residues are predicted to reside within the C-terminal heptad repeat (HR2) region of JSRV and ENTV Envs, particularly proximal to the hairpin turn of the putative six-helix bundle (6HB). Substitution of the JSRV Leu with a Val blocked the Env-mediated membrane fusion at pH 5.0, whereas replacement of the ENTV Val with a Leu rendered the ENTV Env capable of fusing at pH 5.0. A Leu-Val change has no apparent effect on the stability of native Env but appears to stabilize an intermediate induced by receptor binding. These results are consistent with the existence of at least two metastable conformations of these viral glycoproteins, the native prefusion conformation and a receptor-induced metastable intermediate. Collectively, this work represents an interesting perhaps unique example whereby a simple Leu-Val change has critical impact on pH-dependent virus fusion and entry

    Editing Commonsense Knowledge in GPT

    Full text link
    Memory editing methods for updating encyclopedic knowledge in transformers have received increasing attention for their efficacy, specificity, and generalization advantages. However, it remains unclear if such methods can be adapted for the more nuanced domain of commonsense knowledge. We propose MEMITCSKMEMIT_{CSK}, an adaptation of MEMIT to edit commonsense mistakes in GPT-2 Large and XL. We extend editing to various token locations and employ a robust layer selection strategy. Models edited by MEMITCSKMEMIT_{CSK} outperforms the fine-tuning baselines by 10.97% and 10.73% F1 scores on subsets of PEP3k and 20Q. We further propose a novel evaluation dataset, MEMIT-CSK-PROBE, that contains unaffected neighborhood, affected neighborhood, affected paraphrase, and affected reasoning challenges. MEMITCSKMEMIT_{CSK} demonstrates favorable semantic generalization, outperforming fine-tuning baselines by 13.72% and 5.57% overall scores on MEMIT-CSK-PROBE. These results suggest a compelling future direction of incorporating context-specific user feedback concerning commonsense in GPT by direct model editing, rectifying and customizing model behaviors via human-in-the-loop systems.Comment: Code and data is available at https://github.com/anshitag/memit_cs

    UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations

    Full text link
    Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate a natural language explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the differences between the performance of human explainers and the best performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several online imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators

    Faith and Fate: Limits of Transformers on Compositionality

    Full text link
    Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify Transformers, we investigate the limits of these models across three representative compositional tasks -- multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how Transformers' performance will rapidly decay with increased task complexity.Comment: 10 pages + appendix (21 pages
    • …
    corecore