12 research outputs found
Model Merging by Uncertainty-Based Gradient Matching
Models trained on different datasets can be merged by a weighted-averaging of
their parameters, but why does it work and when can it fail? Here, we connect
the inaccuracy of weighted-averaging to mismatches in the gradients and propose
a new uncertainty-based scheme to improve the performance by reducing the
mismatch. The connection also reveals implicit assumptions in other schemes
such as averaging, task arithmetic, and Fisher-weighted averaging. Our new
method gives consistent improvements for large language models and vision
transformers, both in terms of performance and robustness to hyperparameters.Comment: Preprint. Under revie
Opportunities and Challenges in Neural Dialog Tutoring
Designing dialog tutors has been challenging as it involves modeling the
diverse and complex pedagogical strategies employed by human tutors. Although
there have been significant recent advances in neural conversational systems
using large language models (LLMs) and growth in available dialog corpora,
dialog tutoring has largely remained unaffected by these advances. In this
paper, we rigorously analyze various generative language models on two dialog
tutoring datasets for language learning using automatic and human evaluations
to understand the new opportunities brought by these advances as well as the
challenges we must overcome to build models that would be usable in real
educational settings. We find that although current approaches can model
tutoring in constrained learning scenarios when the number of concepts to be
taught and possible teacher strategies are small, they perform poorly in less
constrained scenarios. Our human quality evaluation shows that both models and
ground-truth annotations exhibit low performance in terms of equitable
tutoring, which measures learning opportunities for students and how engaging
the dialog is. To understand the behavior of our models in a real tutoring
setting, we conduct a user study using expert annotators and find a
significantly large number of model reasoning errors in 45% of conversations.
Finally, we connect our findings to outline future work.Comment: EACL 2023 (main conference, camera-ready
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems
While automatic dialogue tutors hold great potential in making education
personalized and more accessible, research on such systems has been hampered by
a lack of sufficiently large and high-quality datasets. Collecting such
datasets remains challenging, as recording tutoring sessions raises privacy
concerns and crowdsourcing leads to insufficient data quality. To address this,
we propose a framework to generate such dialogues by pairing human teachers
with a Large Language Model (LLM) prompted to represent common student errors.
We describe how we use this framework to collect MathDial, a dataset of 3k
one-to-one teacher-student tutoring dialogues grounded in multi-step math
reasoning problems. While models like GPT-3 are good problem solvers, they fail
at tutoring because they generate factually incorrect feedback or are prone to
revealing solutions to students too early. To overcome this, we let teachers
provide learning opportunities to students by guiding them using various
scaffolding questions according to a taxonomy of teacher moves. We demonstrate
MathDial and its extensive annotations can be used to finetune models to be
more effective tutors (and not just solvers). We confirm this by automatic and
human evaluation, notably in an interactive setting that measures the trade-off
between student solving success and telling solutions. The dataset is released
publicly.Comment: Jakub Macina, Nico Daheim, and Sankalan Pal Chowdhury contributed
equally to this work. Accepted at EMNLP2023 Findings. Code and dataset
available: https://github.com/eth-nlped/mathdia
Uncertainty in Natural Language Generation: From Theory to Applications
Recent advances of powerful Language Models have allowed Natural Language
Generation (NLG) to emerge as an important technology that can not only perform
traditional tasks like summarisation or translation, but also serve as a
natural language interface to a variety of applications. As such, it is crucial
that NLG systems are trustworthy and reliable, for example by indicating when
they are likely to be wrong; and supporting multiple views, backgrounds and
writing styles -- reflecting diverse human sub-populations. In this paper, we
argue that a principled treatment of uncertainty can assist in creating systems
and evaluation protocols better aligned with these goals. We first present the
fundamental theory, frameworks and vocabulary required to represent
uncertainty. We then characterise the main sources of uncertainty in NLG from a
linguistic perspective, and propose a two-dimensional taxonomy that is more
informative and faithful than the popular aleatoric/epistemic dichotomy.
Finally, we move from theory to applications and highlight exciting research
directions that exploit uncertainty to power decoding, controllable generation,
self-assessment, selective answering, active learning and more
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe