49 research outputs found
Efficient Hierarchical Domain Adaptation for Pretrained Language Models
The remarkable success of large language models has been driven by dense
models trained on massive unlabeled, unstructured corpora. These corpora
typically contain text from diverse, heterogeneous sources, but information
about the source of the text is rarely used during training. Transferring their
knowledge to a target domain is typically done by continuing training
in-domain. In this paper, we introduce a method to permit domain adaptation to
many diverse domains using a computationally efficient adapter approach. Our
method is based on the observation that textual domains are partially
overlapping, and we represent domains as a hierarchical tree structure where
each node in the tree is associated with a set of adapter weights. When
combined with a frozen pretrained language model, this approach enables
parameter sharing among related domains, while avoiding negative interference
between unrelated ones. Experimental results with GPT-2 and a large fraction of
the 100 most represented websites in C4 show across-the-board improvements
in-domain. We additionally provide an inference time algorithm for a held-out
domain and show that averaging over multiple paths through the tree enables
further gains in generalization, while adding only a marginal cost to
inference.Comment: NAACL 2022 accepted paper camera ready versio
Stubborn Lexical Bias in Data and Models
In NLP, recent work has seen increased focus on spurious correlations between
various features and labels in training data, and how these influence model
behavior. However, the presence and effect of such correlations are typically
examined feature by feature. We investigate the cumulative impact on a model of
many such intersecting features. Using a new statistical method, we examine
whether such spurious patterns in data appear in models trained on the data. We
select two tasks -- natural language inference and duplicate-question detection
-- for which any unigram feature on its own should ideally be uninformative,
which gives us a large pool of automatically extracted features with which to
experiment. The large size of this pool allows us to investigate the
intersection of features spuriously associated with (potentially different)
labels. We then apply an optimization approach to *reweight* the training data,
reducing thousands of spurious correlations, and examine how doing so affects
models trained on the reweighted data. Surprisingly, though this method can
successfully reduce lexical biases in the training data, we still find strong
evidence of corresponding bias in the trained models, including worsened bias
for slightly more complex features (bigrams). We close with discussion about
the implications of our results on what it means to "debias" training data, and
how issues of data quality can affect model bias.Comment: ACL Findings 202
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Pretrained language models (PLMs) are trained on massive corpora, but often
need to specialize to specific domains. A parameter-efficient adaptation method
suggests training an adapter for each domain on the task of language modeling.
This leads to good in-domain scores but can be impractical for domain- or
resource-restricted settings. A solution is to use a related-domain adapter for
the novel domain at test time. In this paper, we introduce AdapterSoup, an
approach that performs weight-space averaging of adapters trained on different
domains. Our approach is embarrassingly parallel: first, we train a set of
domain-specific adapters; then, for each novel domain, we determine which
adapters should be averaged at test time. We present extensive experiments
showing that AdapterSoup consistently improves performance to new domains
without extra training. We also explore weight averaging of adapters trained on
the same domain with different hyper-parameters, and show that it preserves the
performance of a PLM on new domains while obtaining strong in-domain results.
We explore various approaches for choosing which adapters to combine, such as
text clustering and semantic similarity. We find that using clustering leads to
the most competitive results on novel domains.Comment: Accepted at EACL 2023; camera-ready versio
Language Models Hallucinate, but May Excel at Fact Verification
Recent progress in natural language processing (NLP) owes much to remarkable
advances in large language models (LLMs). Nevertheless, LLMs frequently
"hallucinate," resulting in non-factual outputs. Our carefully designed human
evaluation substantiates the serious hallucination issue, revealing that even
GPT-3.5 produces factual outputs less than 25% of the time. This underscores
the importance of fact verifiers in order to measure and incentivize progress.
Our systematic investigation affirms that LLMs can be repurposed as effective
fact verifiers with strong correlations with human judgments, at least in the
Wikipedia domain. Surprisingly, FLAN-T5-11B, the least factual generator in our
study, performs the best as a fact verifier, even outperforming more capable
LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these
LLMs on high-quality evidence, as well as their deficiencies in robustness and
generalization ability. Our study presents insights for developing trustworthy
generation models.Comment: 9 page