88 research outputs found
Can Language Models perform Abductive Commonsense Reasoning?
Abductive Reasoning is a task of inferring the most plausible hypothesis
given a set of observations. In literature, the community has approached to
solve this challenge by classifying/generating a likely hypothesis that does
not contradict with a past observation and future observation. Some of the most
well-known benchmarks that tackle this problem are aNLI and aNLG (pronounced as
alpha-NLI and alpha-NLG). In this report, I review over some of the
methodologies that were attempted to solve this challenge, re-implement the
baseline models, and analyze some of the weaknesses that current approaches
have. The code and the re-implemented results are available at this link.Comment: 6 page
The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code
Causal reasoning, the ability to identify cause-and-effect relationship, is
crucial in human thinking. Although large language models (LLMs) succeed in
many NLP tasks, it is still challenging for them to conduct complex causal
reasoning like abductive reasoning and counterfactual reasoning. Given the fact
that programming code may express causal relations more often and explicitly
with conditional statements like ``if``, we want to explore whether Code-LLMs
acquire better causal reasoning abilities. Our experiments show that compared
to text-only LLMs, Code-LLMs with code prompts are significantly better in
causal reasoning. We further intervene on the prompts from different aspects,
and discover that the programming structure is crucial in code prompt design,
while Code-LLMs are robust towards format perturbations.Comment: Findings of ACL 2023. Code and data are available at
https://github.com/xxxiaol/magic-i
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
Leaderboards have eased model development for many NLP datasets by
standardizing their evaluation and delegating it to an independent external
repository. Their adoption, however, is so far limited to tasks that can be
reliably evaluated in an automatic manner. This work introduces GENIE, an
extensible human evaluation leaderboard, which brings the ease of leaderboards
to text generation tasks. GENIE automatically posts leaderboard submissions to
crowdsourcing platforms asking human annotators to evaluate them on various
axes (e.g., correctness, conciseness, fluency) and compares their answers to
various automatic metrics. We introduce several datasets in English to GENIE,
representing four core challenges in text generation: machine translation,
summarization, commonsense reasoning, and machine comprehension. We provide
formal granular evaluation metrics and identify areas for future research. We
make GENIE publicly available and hope that it will spur progress in language
generation models as well as their automatic and manual evaluation
Toward Unified Controllable Text Generation via Regular Expression Instruction
Controllable text generation is a fundamental aspect of natural language
generation, with numerous methods proposed for different constraint types.
However, these approaches often require significant architectural or decoding
modifications, making them challenging to apply to additional constraints or
resolve different constraint combinations. To address this, our paper
introduces Regular Expression Instruction (REI), which utilizes an
instruction-based mechanism to fully exploit regular expressions' advantages to
uniformly model diverse constraints. Specifically, our REI supports all popular
fine-grained controllable generation constraints, i.e., lexical, positional,
and length, as well as their complex combinations, via regular expression-style
instructions. Our method only requires fine-tuning on medium-scale language
models or few-shot, in-context learning on large language models, and requires
no further adjustment when applied to various constraint combinations.
Experiments demonstrate that our straightforward approach yields high success
rates and adaptability to various constraints while maintaining competitiveness
in automatic metrics and outperforming most previous baselines.Comment: Accepted on IJCNLP-AACL 202
Embarrassingly Simple Performance Prediction for Abductive Natural Language Inference
The task of abductive natural language inference (\alpha{}nli), to decide
which hypothesis is the more likely explanation for a set of observations, is a
particularly difficult type of NLI. Instead of just determining a causal
relationship, it requires common sense to also evaluate how reasonable an
explanation is. All recent competitive systems build on top of contextualized
representations and make use of transformer architectures for learning an NLI
model. When somebody is faced with a particular NLI task, they need to select
the best model that is available. This is a time-consuming and resource-intense
endeavour. To solve this practical problem, we propose a simple method for
predicting the performance without actually fine-tuning the model. We do this
by testing how well the pre-trained models perform on the \alpha{}nli task when
just comparing sentence embeddings with cosine similarity to what the
performance that is achieved when training a classifier on top of these
embeddings. We show that the accuracy of the cosine similarity approach
correlates strongly with the accuracy of the classification approach with a
Pearson correlation coefficient of 0.65. Since the similarity computation is
orders of magnitude faster to compute on a given dataset (less than a minute
vs. hours), our method can lead to significant time savings in the process of
model selection.Comment: accepted at NAACL 202
- …