19 research outputs found
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Fine-tuning pre-trained transformer-based language models such as BERT has
become a common practice dominating leaderboards across various NLP benchmarks.
Despite the strong empirical performance of fine-tuned models, fine-tuning is
an unstable process: training the same model with multiple random seeds can
result in a large variance of the task performance. Previous literature (Devlin
et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential
reasons for the observed instability: catastrophic forgetting and small size of
the fine-tuning datasets. In this paper, we show that both hypotheses fail to
explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT,
fine-tuned on three commonly used datasets from the GLUE benchmark, and show
that the observed instability is caused by optimization difficulties that lead
to vanishing gradients. Additionally, we show that the remaining variance of
the downstream task performance can be attributed to differences in
generalization where fine-tuned models with the same training loss exhibit
noticeably different test performance. Based on our analysis, we present a
simple but strong baseline that makes fine-tuning BERT-based models
significantly more stable than the previously proposed approaches. Code to
reproduce our results is available online:
https://github.com/uds-lsv/bert-stable-fine-tuning
Weaker Than You Think: A Critical Look atWeakly Supervised Learning
Weakly supervised learning is a popular approach for training machine
learning models in low-resource settings. Instead of requesting high-quality
yet costly human annotations, it allows training models with noisy annotations
obtained from various weak sources. Recently, many sophisticated approaches
have been proposed for robust training under label noise, reporting impressive
results. In this paper, we revisit the setup of these approaches and find that
the benefits brought by these approaches are significantly overestimated.
Specifically, we find that the success of existing weakly supervised learning
approaches heavily relies on the availability of clean validation samples
which, as we show, can be leveraged much more efficiently by simply training on
them. After using these clean labels in training, the advantages of using these
sophisticated approaches are mostly wiped out. This remains true even when
reducing the size of the available clean data to just five samples per class,
making these approaches impractical. To understand the true value of weakly
supervised learning, we thoroughly analyse diverse NLP datasets and tasks to
ascertain when and why weakly supervised approaches work, and provide
recommendations for future research.Comment: ACL 202
Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation
Few-shot fine-tuning and in-context learning are two alternative strategies
for task adaptation of pre-trained language models. Recently, in-context
learning has gained popularity over fine-tuning due to its simplicity and
improved out-of-domain generalization, and because extensive evidence shows
that fine-tuned models pick up on spurious correlations. Unfortunately,
previous comparisons of the two approaches were done using models of different
sizes. This raises the question of whether the observed weaker out-of-domain
generalization of fine-tuned models is an inherent property of fine-tuning or a
limitation of the experimental setup. In this paper, we compare the
generalization of few-shot fine-tuning and in-context learning to challenge
datasets, while controlling for the models used, the number of examples, and
the number of parameters, ranging from 125M to 30B. Our results show that
fine-tuned language models can in fact generalize well out-of-domain. We find
that both approaches generalize similarly; they exhibit large variation and
depend on properties such as model size and the number of examples,
highlighting that robust task adaptation remains a challenge.Comment: Accepted to Findings of ACL 202
On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers
This contribution seeks to provide a rational probabilistic explanation for the intelligibility
of words in a genetically related language that is unknown to the reader, a phenomenon
referred to as intercomprehension. In this research domain, linguistic distance, among
other factors, was proved to correlate well with the mutual intelligibility of individual words.
However, the role of context for the intelligibility of target words in sentences was subject
to very few studies. To address this, we analyze data from web-based experiments in
which Czech (CS) respondents were asked to translate highly predictable target words at
the final position of Polish sentences. We compare correlations of target word intelligibility
with data from 3-g language models (LMs) to their correlations with data obtained from
context-aware LMs. More specifically, we evaluate two context-aware LM architectures:
Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance
dependencies into account and Transformer-based LMs which can access the whole
input sequence at the same time. We investigate how their use of context affects surprisal
and its correlation with intelligibility