38 research outputs found
The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated
Pre-trained language models trained on large-scale data have learned serious
levels of social biases. Consequently, various methods have been proposed to
debias pre-trained models. Debiasing methods need to mitigate only
discriminatory bias information from the pre-trained models, while retaining
information that is useful for the downstream tasks. In previous research,
whether useful information is retained has been confirmed by the performance of
downstream tasks in debiased pre-trained models. On the other hand, it is not
clear whether these benchmarks consist of data pertaining to social biases and
are appropriate for investigating the impact of debiasing. For example in
gender-related social biases, data containing female words (e.g. ``she, female,
woman''), male words (e.g. ``he, male, man''), and stereotypical words (e.g.
``nurse, doctor, professor'') are considered to be the most affected by
debiasing. If there is not much data containing these words in a benchmark
dataset for a target task, there is the possibility of erroneously evaluating
the effects of debiasing. In this study, we compare the impact of debiasing on
performance across multiple downstream tasks using a wide-range of benchmark
datasets that containing female, male, and stereotypical words. Experiments
show that the effects of debiasing are consistently \emph{underestimated}
across all tasks. Moreover, the effects of debiasing could be reliably
evaluated by separately considering instances containing female, male, and
stereotypical words than all of the instances in a benchmark dataset.Comment: IJCNLP-AACL 202
Neural Natural Language Inference Models Enhanced with External Knowledge
Modeling natural language inference is a very challenging task. With the
availability of large annotated data, it has recently become feasible to train
complex models such as neural-network-based inference models, which have shown
to achieve the state-of-the-art performance. Although there exist relatively
large annotated data, can machines learn all knowledge needed to perform
natural language inference (NLI) from these data? If not, how can
neural-network-based NLI models benefit from external knowledge and how to
build NLI models to leverage it? In this paper, we enrich the state-of-the-art
neural natural language inference models with external knowledge. We
demonstrate that the proposed models improve neural NLI models to achieve the
state-of-the-art performance on the SNLI and MultiNLI datasets.Comment: Accepted by ACL 201
Task-oriented Memory-efficient Pruning-Adapter
The Outstanding performance and growing size of Large Language Models has led
to increased attention in parameter efficient learning. The two predominant
approaches are Adapters and Pruning. Adapters are to freeze the model and give
it a new weight matrix on the side, which can significantly reduce the time and
memory of training, but the cost is that the evaluation and testing will
increase the time and memory consumption. Pruning is to cut off some weight and
re-distribute the remaining weight, which sacrifices the complexity of training
at the cost of extremely high memory and training time, making the cost of
evaluation and testing relatively low. So efficiency of training and inference
can't be obtained in the same time. In this work, we propose a task-oriented
Pruning-Adapter method that achieve a high memory efficiency of training and
memory, and speeds up training time and ensures no significant decrease in
accuracy in GLUE tasks, achieving training and inference efficiency at the same
time
Can Large Language Models Infer and Disagree Like Humans?
Large Language Models (LLMs) have shown stellar achievements in solving a
broad range of tasks. When generating text, it is common to sample tokens from
these models: whether LLMs closely align with the human disagreement
distribution has not been well-studied, especially within the scope of Natural
Language Inference (NLI). In this paper, we evaluate the performance and
alignment of LLM distribution with humans using two different techniques: Monte
Carlo Reconstruction (MCR) and Log Probability Reconstruction (LPR). As a
result, we show LLMs exhibit limited ability in solving NLI tasks and
simultaneously fail to capture human disagreement distribution, raising
concerns about their natural language understanding (NLU) ability and their
representativeness of human users
Improving Input-label Mapping with Demonstration Replay for In-context Learning
In-context learning (ICL) is an emerging capability of large autoregressive
language models where a few input-label demonstrations are appended to the
input to enhance the model's understanding of downstream NLP tasks, without
directly adjusting the model parameters. The effectiveness of ICL can be
attributed to the strong language modeling capabilities of large language
models (LLMs), which enable them to learn the mapping between input and labels
based on in-context demonstrations. Despite achieving promising results, the
causal nature of language modeling in ICL restricts the attention to be
backward only, i.e., a token only attends to its previous tokens, failing to
capture the full input-label information and limiting the model's performance.
In this paper, we propose a novel ICL method called Repeated Demonstration with
Sliding Causal Attention, (RdSca). Specifically, we duplicate later
demonstrations and concatenate them to the front, allowing the model to
`observe' the later information even under the causal restriction. Besides, we
introduce sliding causal attention, which customizes causal attention to avoid
information leakage. Experimental results show that our method significantly
improves the input-label mapping in ICL demonstrations. We also conduct an
in-depth analysis of how to customize the causal attention without training,
which has been an unexplored area in previous research
Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning
Large Language models (LLMs) possess the capability to engage In-context
Learning (ICL) by leveraging a few demonstrations pertaining to a new
downstream task as conditions. However, this particular learning paradigm
suffers from high instability stemming from substantial variances induced by
factors such as the input distribution of selected examples, their ordering,
and prompt formats. In this work, we demonstrate that even when all these
factors are held constant, the random selection of examples still results in
high variance. Consequently, we aim to explore the informative ability of data
examples by quantifying the Information Gain (IG) obtained in prediction after
observing a given example candidate. Then we propose to sample those with
maximum IG. Additionally, we identify the presence of template bias, which can
lead to unfair evaluations of IG during the sampling process. To mitigate this
bias, we introduce Calibration Before Sampling strategy. The experimental
results illustrate that our proposed method can yield an average relative
improvement of 14.3% across six classification tasks using three LLMs.Comment: Accepted to the Findings of EMNLP 202
Transformers with Learnable Activation Functions
Activation functions can have a significant impact on reducing the
topological complexity of input data and therefore improve the performance of
the model. Selecting a suitable activation function is an essential step in
neural model design. However, the choice of activation function is seldom
discussed or explored in Transformer-based language models. Their activation
functions are chosen beforehand and then remain fixed from pre-training to
fine-tuning. As a result, the inductive biases they imposed on models cannot be
adjusted during this long life cycle. Moreover, subsequently developed models
(e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use
the same activation function without justification. In this paper, we
investigate the effectiveness of using Rational Activation Function (RAF), a
learnable activation function, in the Transformer architecture. In contrast to
conventional, predefined activation functions, RAFs can adaptively learn
optimal activation functions during training according to input data. Our
experiments show the RAF-based Transformer (RAFT) achieves a lower validation
perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT
on downstream tasks in low- and full-data settings. Our results show that RAFT
outperforms the counterpart model across the majority of tasks and settings.
For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71
points on average in low-data scenario (where 100 training examples are
available) and by 2.05 points on SQuAD in full-data setting. Analysis of the
shapes of learned RAFs further unveils that they substantially vary between
different layers of the pre-trained model and mostly look very different from
conventional activation functions. RAFT opens a new research direction for
analyzing and interpreting pre-trained models according to the learned
activation functions.Comment: Accepted by EACL2023 finding
Tree Prompting: Efficient Task Adaptation without Fine-Tuning
Prompting language models (LMs) is the main interface for applying them to
new tasks. However, for smaller LMs, prompting provides low accuracy compared
to gradient-based finetuning. Tree Prompting is an approach to prompting which
builds a decision tree of prompts, linking multiple LM calls together to solve
a task. At inference time, each call to the LM is determined by efficiently
routing the outcome of the previous call using the tree. Experiments on
classification datasets show that Tree Prompting improves accuracy over
competing methods and is competitive with fine-tuning. We also show that
variants of Tree Prompting allow inspection of a model's decision-making
process.Comment: Both first authors contributed equally; accepted to EMNLP 202
Can Large Language Models Infer Causation from Correlation?
Causal inference is one of the hallmarks of human intelligence. While the
field of CausalNLP has attracted much interest in the recent years, existing
causal inference datasets in NLP primarily rely on discovering causality from
empirical knowledge (e.g., commonsense knowledge). In this work, we propose the
first benchmark dataset to test the pure causal inference skills of large
language models (LLMs). Specifically, we formulate a novel task Corr2Cause,
which takes a set of correlational statements and determines the causal
relationship between the variables. We curate a large-scale dataset of more
than 400K samples, on which we evaluate seventeen existing LLMs. Through our
experiments, we identify a key shortcoming of LLMs in terms of their causal
inference skills, and show that these models achieve almost close to random
performance on the task. This shortcoming is somewhat mitigated when we try to
re-purpose LLMs for this skill via finetuning, but we find that these models
still fail to generalize -- they can only perform causal inference in
in-distribution settings when variable names and textual expressions used in
the queries are similar to those in the training set, but fail in
out-of-distribution settings generated by perturbing these queries. Corr2Cause
is a challenging task for LLMs, and would be helpful in guiding future research
on improving LLMs' pure reasoning skills and generalizability. Our data is at
https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at
https://github.com/causalNLP/corr2cause