8 research outputs found
Targeted Data Generation: Finding and Fixing Model Weaknesses
Even when aggregate accuracy is high, state-of-the-art NLP models often fail
systematically on specific subgroups of data, resulting in unfair outcomes and
eroding user trust. Additional data collection may not help in addressing these
weaknesses, as such challenging subgroups may be unknown to users, and
underrepresented in the existing and new data. We propose Targeted Data
Generation (TDG), a framework that automatically identifies challenging
subgroups, and generates new data for those subgroups using large language
models (LLMs) with a human in the loop. TDG estimates the expected benefit and
potential harm of data augmentation for each subgroup, and selects the ones
most likely to improve within group performance without hurting overall
performance. In our experiments, TDG significantly improves the accuracy on
challenging subgroups for state-of-the-art sentiment analysis and natural
language inference models, while also improving overall test accuracy.Comment: Accepted to ACL 202
Prompt Engineering a Prompt Engineer
Prompt engineering is a challenging yet crucial task for optimizing the
performance of large language models (LLMs). It requires complex reasoning to
examine the model's errors, hypothesize what is missing or misleading in the
current prompt, and communicate the task with clarity. While recent works
indicate that LLMs can be meta-prompted to perform automatic prompt
engineering, their potentials may not be fully untapped due to the lack of
sufficient guidance to elicit complex reasoning capabilities in LLMs in the
meta-prompt. In this work, we investigate the problem of "prompt engineering a
prompt engineer" -- constructing a meta-prompt that more effectively guides
LLMs to perform automatic prompt engineering. We introduce and analyze key
components, such as a step-by-step reasoning template and context
specification, which lead to improved performance. In addition, inspired by
common optimization concepts such as batch size, step size and momentum, we
introduce their verbalized counterparts to the meta-prompt and investigate
their effects. Our final method, named PE2, finds a prompt that outperforms
"let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the
GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction
Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world
industrial prompt. In these settings, PE2 achieves strong performance and
outperforms prior automatic prompt engineering baselines. Further, we show that
PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete
prompts, and presents non-trivial counterfactual reasoning abilities
MaskTune: Mitigating Spurious Correlations by Forcing to Explore
A fundamental challenge of over-parameterized deep learning models is
learning meaningful data representations that yield good performance on a
downstream task without over-fitting spurious input features. This work
proposes MaskTune, a masking strategy that prevents over-reliance on spurious
(or a limited number of) features. MaskTune forces the trained model to explore
new features during a single epoch finetuning by masking previously discovered
features. MaskTune, unlike earlier approaches for mitigating shortcut learning,
does not require any supervision, such as annotating spurious features or
labels for subgroup samples in a dataset. Our empirical results on biased
MNIST, CelebA, Waterbirds, and ImagenNet-9L datasets show that MaskTune is
effective on tasks that often suffer from the existence of spurious
correlations. Finally, we show that MaskTune outperforms or achieves similar
performance to the competing methods when applied to the selective
classification (classification with rejection option) task. Code for MaskTune
is available at https://github.com/aliasgharkhani/Masktune.Comment: Accepted to NeurIPS 202
Learning precise partial semantic mappings via linear algebra
Thesis: S.M. in Computer Science and Engineering, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.Cataloged from PDF version of thesis.Includes bibliographical references (pages 41-42).In natural language interfaces, having high precision, i.e., abstaining when the system is unsure, is critical for good user experience. However, most NLP systems are trained to maximize accuracy with precision as an afterthought. In this thesis, we put precision first and ask: Can we learn to map parts of the sentence to logical predicates with absolute certainty? To tackle this question, we model semantic mappings from words to predicates as matrices, which allows us to reason efficiently over the entire space of semantic mappings consistent with the training data. We prove that our method obtains 100% precision. Empirically, we demonstrate the effectiveness of our approach on the GeoQuery dataset.by Fereshte Khani.S.M. in Computer Science and Engineerin
Unanimous Prediction for 100\% Precision with Application to Learning Semantic Mappings
© 2016 Association for Computational Linguistics. Can we train a system that, on any new input, either says "don't know" or makes a prediction that is guaranteed to be correct? We answer the question in the affirmative provided our model family is wellspecified. Specifically, we introduce the unanimity principle: only predict when all models consistent with the training data predict the same output. We operationalize this principle for semantic parsing, the task of mapping utterances to logical forms. We develop a simple, efficient method that reasons over the infinite set of all consistent models by only checking two of the models. We prove that our method obtains 100% precision even with a modest amount of training data from a possibly adversarial distribution. Empirically, we demonstrate the effectiveness of our approach on the standard GeoQuery dataset