102 research outputs found
In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search
Since large language models have approached human-level performance on many
tasks, it has become increasingly harder for researchers to find tasks that are
still challenging to the models. Failure cases usually come from the long-tail
distribution - data that an oracle language model could assign a probability on
the lower end of its distribution. Current methodology such as prompt
engineering or crowdsourcing are insufficient for creating long-tail examples
because humans are constrained by cognitive bias. We propose a
Logic-Induced-Knowledge-Search (LINK) framework for systematically generating
long-tail knowledge statements. Grounded by a symbolic rule, we search for
long-tail values for each variable of the rule by first prompting a LLM, then
verifying the correctness of the values with a critic, and lastly pushing for
the long-tail distribution with a reranker. With this framework we construct a
dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and
50K knowledge statements spanning across four domains. Human annotations find
that 84% of the statements in LINT are factually correct. In contrast, ChatGPT
and GPT4 struggle with directly generating long-tail statements under the
guidance of logic rules, each only getting 56% and 78% of their statements
correct. Moreover, their "long-tail" generations in fact fall into the higher
likelihood range, and thus are not really long-tail. Our findings suggest that
LINK is effective for generating data in the long-tail distribution while
enforcing quality. LINT can be useful for systematically evaluating LLMs'
capabilities in the long-tail distribution. We challenge the models with a
simple entailment classification task using samples from LINT. We find that
ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in
the long-tail distribution compared to head distribution
A Systematic Investigation of Commonsense Knowledge in Large Language Models
Language models (LMs) trained on large amounts of data have shown impressive
performance on many NLP tasks under the zero-shot and few-shot setup. Here we
aim to better understand the extent to which such models learn commonsense
knowledge -- a critical component of many NLP applications. We conduct a
systematic and rigorous zero-shot and few-shot commonsense evaluation of large
pre-trained LMs, where we: (i) carefully control for the LMs' ability to
exploit potential surface cues and annotation artefacts, and (ii) account for
variations in performance that arise from factors that are not related to
commonsense knowledge. Our findings highlight the limitations of pre-trained
LMs in acquiring commonsense knowledge without task-specific supervision;
furthermore, using larger models or few-shot evaluation are insufficient to
achieve human-level commonsense performance.Comment: Accepted to EMNLP 202
Critical Role of Leucine-Valine Change in Distinct Low pH Requirements for Membrane Fusion between Two Related Retrovirus Envelopes
Many viruses use a pH-dependent pathway for fusion with host cell membrane, the mechanism of which is still poorly understood. Here we report that a subtle leucine (Leu)-valine (Val) change at position 501 in the envelope glycoproteins (Envs) of two related retroviruses, jaagsiekte sheep retro-virus (JSRV) and enzootic nasal tumor virus (ENTV), is responsible for their distinct low pH requirements for membrane fusion and infection. The Leu and Val residues are predicted to reside within the C-terminal heptad repeat (HR2) region of JSRV and ENTV Envs, particularly proximal to the hairpin turn of the putative six-helix bundle (6HB). Substitution of the JSRV Leu with a Val blocked the Env-mediated membrane fusion at pH 5.0, whereas replacement of the ENTV Val with a Leu rendered the ENTV Env capable of fusing at pH 5.0. A Leu-Val change has no apparent effect on the stability of native Env but appears to stabilize an intermediate induced by receptor binding. These results are consistent with the existence of at least two metastable conformations of these viral glycoproteins, the native prefusion conformation and a receptor-induced metastable intermediate. Collectively, this work represents an interesting perhaps unique example whereby a simple Leu-Val change has critical impact on pH-dependent virus fusion and entry
Editing Commonsense Knowledge in GPT
Memory editing methods for updating encyclopedic knowledge in transformers
have received increasing attention for their efficacy, specificity, and
generalization advantages. However, it remains unclear if such methods can be
adapted for the more nuanced domain of commonsense knowledge. We propose
, an adaptation of MEMIT to edit commonsense mistakes in GPT-2
Large and XL. We extend editing to various token locations and employ a robust
layer selection strategy. Models edited by outperforms the
fine-tuning baselines by 10.97% and 10.73% F1 scores on subsets of PEP3k and
20Q. We further propose a novel evaluation dataset, MEMIT-CSK-PROBE, that
contains unaffected neighborhood, affected neighborhood, affected paraphrase,
and affected reasoning challenges. demonstrates favorable
semantic generalization, outperforming fine-tuning baselines by 13.72% and
5.57% overall scores on MEMIT-CSK-PROBE. These results suggest a compelling
future direction of incorporating context-specific user feedback concerning
commonsense in GPT by direct model editing, rectifying and customizing model
behaviors via human-in-the-loop systems.Comment: Code and data is available at https://github.com/anshitag/memit_cs
UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations
Language technologies that accurately model the dynamics of events must
perform commonsense reasoning. Existing work evaluating commonsense reasoning
focuses on making inferences about common, everyday situations. To instead
investigate the ability to model unusual, unexpected, and unlikely situations,
we explore the task of uncommonsense abductive reasoning. Given a piece of
context with an unexpected outcome, this task requires reasoning abductively to
generate a natural language explanation that makes the unexpected outcome more
likely in the context. To this end, we curate and release a new English
language corpus called UNcommonsense. We characterize the differences between
the performance of human explainers and the best performing large language
models, finding that model-enhanced human-written explanations achieve the
highest quality by trading off between specificity and diversity. Finally, we
experiment with several online imitation learning algorithms to train open and
accessible language models on this task. When compared with the vanilla
supervised fine-tuning approach, these methods consistently reduce lose rates
on both common and uncommonsense abductive reasoning judged by human
evaluators
Faith and Fate: Limits of Transformers on Compositionality
Transformer large language models (LLMs) have sparked admiration for their
exceptional performance on tasks that demand intricate multi-step reasoning.
Yet, these models simultaneously show failures on surprisingly trivial
problems. This begs the question: Are these errors incidental, or do they
signal more substantial limitations? In an attempt to demystify Transformers,
we investigate the limits of these models across three representative
compositional tasks -- multi-digit multiplication, logic grid puzzles, and a
classic dynamic programming problem. These tasks require breaking problems down
into sub-steps and synthesizing these steps into a precise answer. We formulate
compositional tasks as computation graphs to systematically quantify the level
of complexity, and break down reasoning steps into intermediate sub-procedures.
Our empirical findings suggest that Transformers solve compositional tasks by
reducing multi-step compositional reasoning into linearized subgraph matching,
without necessarily developing systematic problem-solving skills. To round off
our empirical study, we provide theoretical arguments on abstract multi-step
reasoning problems that highlight how Transformers' performance will rapidly
decay with increased task complexity.Comment: 10 pages + appendix (21 pages
- …