2 research outputs found
Fine-Tuning Language Models via Epistemic Neural Networks
Large language models are now part of a powerful new paradigm in machine
learning. These models learn a wide range of capabilities from training on
large unsupervised text corpora. In many applications, these capabilities are
then fine-tuned through additional training on specialized data to improve
performance in that setting. In this paper, we augment these models with an
epinet: a small additional network architecture that helps to estimate model
uncertainty and form an epistemic neural network (ENN). ENNs are neural
networks that can know what they don't know. We show that, using an epinet to
prioritize uncertain data, we can fine-tune BERT on GLUE tasks to the same
performance while using 2x less data. We also investigate performance in
synthetic neural network generative models designed to build understanding. In
each setting, using an epinet outperforms heuristic active learning schemes
Fine-tuning language models to find agreement among humans with diverse preferences
Recent work in large language modeling (LLMs) has used fine-tuning to align
outputs with the preferences of a prototypical user. This work assumes that
human preferences are static and homogeneous across individuals, so that
aligning to a a single "generic" user will confer more general alignment. Here,
we embrace the heterogeneity of human preferences to consider a different
challenge: how might a machine help people with diverse views find agreement?
We fine-tune a 70 billion parameter LLM to generate statements that maximize
the expected approval for a group of people with potentially diverse opinions.
Human participants provide written opinions on thousands of questions touching
on moral and political issues (e.g., "should we raise taxes on the rich?"), and
rate the LLM's generated candidate consensus statements for agreement and
quality. A reward model is then trained to predict individual preferences,
enabling it to quantify and rank consensus statements in terms of their appeal
to the overall group, defined according to different aggregation (social
welfare) functions. The model produces consensus statements that are preferred
by human users over those from prompted LLMs (>70%) and significantly
outperforms a tight fine-tuned baseline that lacks the final ranking step.
Further, our best model's consensus statements are preferred over the best
human-generated opinions (>65%). We find that when we silently constructed
consensus statements from only a subset of group members, those who were
excluded were more likely to dissent, revealing the sensitivity of the
consensus to individual contributions. These results highlight the potential to
use LLMs to help groups of humans align their values with one another