89 research outputs found
Prompt Engineering a Prompt Engineer
Prompt engineering is a challenging yet crucial task for optimizing the
performance of large language models (LLMs). It requires complex reasoning to
examine the model's errors, hypothesize what is missing or misleading in the
current prompt, and communicate the task with clarity. While recent works
indicate that LLMs can be meta-prompted to perform automatic prompt
engineering, their potentials may not be fully untapped due to the lack of
sufficient guidance to elicit complex reasoning capabilities in LLMs in the
meta-prompt. In this work, we investigate the problem of "prompt engineering a
prompt engineer" -- constructing a meta-prompt that more effectively guides
LLMs to perform automatic prompt engineering. We introduce and analyze key
components, such as a step-by-step reasoning template and context
specification, which lead to improved performance. In addition, inspired by
common optimization concepts such as batch size, step size and momentum, we
introduce their verbalized counterparts to the meta-prompt and investigate
their effects. Our final method, named PE2, finds a prompt that outperforms
"let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the
GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction
Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world
industrial prompt. In these settings, PE2 achieves strong performance and
outperforms prior automatic prompt engineering baselines. Further, we show that
PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete
prompts, and presents non-trivial counterfactual reasoning abilities
Measuring Innovation
This thesis examines the innovation premium metric to determine how well it
measures the innovation potential of companies, as determined by investor
sentiment. The innovation premium is the proportion of a company’s market
capitalization that exceeds the net present value of the company’s cash flows from
its current products in its current markets. Through the use of the annual Forbes
lists of the World’s Most Innovative Companies, a stock analysis is conducted to
test the validity of the innovation premium measure. High innovation premium
values indicate an increased likelihood of innovation occurring and higher
probability of success, but even for companies with the highest innovation
premiums, there remains a large risk of failure. This thesis investigates the
innovation premium values of both innovative companies and a control group in
order to draw conclusions about the validity of the innovation premium metric.Bachelor of Scienc
In-Context Demonstration Selection with Cross Entropy Difference
Large language models (LLMs) can use in-context demonstrations to improve
performance on zero-shot tasks. However, selecting the best in-context examples
is challenging because model performance can vary widely depending on the
selected examples. We present a cross-entropy difference (CED) method for
selecting in-context demonstrations. Our method is based on the observation
that the effectiveness of in-context demonstrations negatively correlates with
the perplexity of the test example by a language model that was finetuned on
that demonstration. We utilize parameter efficient finetuning to train small
models on training data that are used for computing the cross-entropy
difference between a test example and every candidate in-context demonstration.
This metric is used to rank and select in-context demonstrations independently
for each test input. We evaluate our method on a mix-domain dataset that
combines 8 benchmarks, representing 4 text generation tasks, showing that CED
for in-context demonstration selection can improve performance for a variety of
LLMs
FAST: Improving Controllability for Text Generation with Feedback Aware Self-Training
Controllable text generation systems often leverage control codes to direct
various properties of the output like style and length. Inspired by recent work
on causal inference for NLP, this paper reveals a previously overlooked flaw in
these control code-based conditional text generation algorithms. Spurious
correlations in the training data can lead models to incorrectly rely on parts
of the input other than the control code for attribute selection, significantly
undermining downstream generation quality and controllability. We demonstrate
the severity of this issue with a series of case studies and then propose two
simple techniques to reduce these correlations in training sets. The first
technique is based on resampling the data according to an example's propensity
towards each linguistic attribute (IPS). The second produces multiple
counterfactual versions of each example and then uses an additional feedback
mechanism to remove noisy examples (feedback aware self-training, FAST). We
evaluate on 3 tasks -- news headline, meta review, and search ads generation --
and demonstrate that FAST can significantly improve the controllability and
language quality of generated outputs when compared to state-of-the-art
controllable text generation approaches
Automatically Neutralizing Subjective Bias in Text
Texts like news, encyclopedias, and some social media strive for objectivity.
Yet bias in the form of inappropriate subjectivity - introducing attitudes via
framing, presupposing truth, and casting doubt - remains ubiquitous. This kind
of bias erodes our collective trust and fuels social conflict. To address this
issue, we introduce a novel testbed for natural language generation:
automatically bringing inappropriately subjective text into a neutral point of
view ("neutralizing" biased text). We also offer the first parallel corpus of
biased language. The corpus contains 180,000 sentence pairs and originates from
Wikipedia edits that removed various framings, presuppositions, and attitudes
from biased sentences. Last, we propose two strong encoder-decoder baselines
for the task. A straightforward yet opaque CONCURRENT system uses a BERT
encoder to identify subjective words as part of the generation process. An
interpretable and controllable MODULAR algorithm separates these steps, using
(1) a BERT-based classifier to identify problematic words and (2) a novel join
embedding through which the classifier can edit the hidden states of the
encoder. Large-scale human evaluation across four domains (encyclopedias, news
headlines, books, and political speeches) suggests that these algorithms are a
first step towards the automatic identification and reduction of bias.Comment: To appear at AAAI 202
APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning
Logical reasoning of text is an important ability that requires understanding
the information present in the text, their interconnections, and then reasoning
through them to infer new conclusions. Prior works on improving the logical
reasoning ability of language models require complex processing of training
data (e.g., aligning symbolic knowledge to text), yielding task-specific data
augmentation solutions that restrict the learning of general logical reasoning
skills. In this work, we propose APOLLO, an adaptively pretrained language
model that has improved logical reasoning abilities. We select a subset of
Wikipedia, based on a set of logical inference keywords, for continued
pretraining of a language model. We use two self-supervised loss functions: a
modified masked language modeling loss where only specific parts-of-speech
words, that would likely require more reasoning than basic language
understanding, are masked, and a sentence-level classification loss that
teaches the model to distinguish between entailment and contradiction types of
sentences. The proposed training paradigm is both simple and independent of
task formats. We demonstrate the effectiveness of APOLLO by comparing it with
prior baselines on two logical reasoning datasets. APOLLO performs comparably
on ReClor and outperforms baselines on LogiQA. The code base has been made
publicly available.Comment: Accepted at ACL 2023, code available at
https://github.com/INK-USC/APOLL
The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions
Recent progress in Large Language Models (LLMs) has produced models that
exhibit remarkable performance across a variety of NLP tasks. However, it
remains unclear whether the existing focus of NLP research accurately captures
the genuine requirements of human users. This paper provides a comprehensive
analysis of the divergence between current NLP research and the needs of
real-world NLP applications via a large-scale collection of user-GPT
conversations. We analyze a large-scale collection of real user queries to GPT.
We compare these queries against existing NLP benchmark tasks and identify a
significant gap between the tasks that users frequently request from LLMs and
the tasks that are commonly studied in academic research. For example, we find
that tasks such as ``design'' and ``planning'' are prevalent in user
interactions but are largely neglected or different from traditional NLP
benchmarks. We investigate these overlooked tasks, dissect the practical
challenges they pose, and provide insights toward a roadmap to make LLMs better
aligned with user needs.Comment: EMNLP 202
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
A fundamental goal of scientific research is to learn about causal
relationships. However, despite its critical role in the life and social
sciences, causality has not had the same importance in Natural Language
Processing (NLP), which has traditionally placed more emphasis on predictive
tasks. This distinction is beginning to fade, with an emerging area of
interdisciplinary research at the convergence of causal inference and language
processing. Still, research on causality in NLP remains scattered across
domains without unified definitions, benchmark datasets and clear articulations
of the challenges and opportunities in the application of causal inference to
the textual domain, with its unique properties. In this survey, we consolidate
research across academic areas and situate it in the broader NLP landscape. We
introduce the statistical challenge of estimating causal effects with text,
encompassing settings where text is used as an outcome, treatment, or to
address confounding. In addition, we explore potential uses of causal inference
to improve the robustness, fairness, and interpretability of NLP models. We
thus provide a unified overview of causal inference for the NLP community.Comment: Accepted to Transactions of the Association for Computational
Linguistics (TACL
- …