29 research outputs found
Automatic and Human-AI Interactive Text Generation
In this tutorial, we focus on text-to-text generation, a class of natural
language generation (NLG) tasks, that takes a piece of text as input and then
generates a revision that is improved according to some specific criteria
(e.g., readability or linguistic styles), while largely retaining the original
meaning and the length of the text. This includes many useful applications,
such as text simplification, paraphrase generation, style transfer, etc. In
contrast to text summarization and open-ended text completion (e.g., story),
the text-to-text generation tasks we discuss in this tutorial are more
constrained in terms of semantic consistency and targeted language styles. This
level of control makes these tasks ideal testbeds for studying the ability of
models to generate text that is both semantically adequate and stylistically
appropriate. Moreover, these tasks are interesting from a technical standpoint,
as they require complex combinations of lexical and syntactical
transformations, stylistic control, and adherence to factual knowledge, -- all
at once. With a special focus on text simplification and revision, this
tutorial aims to provide an overview of the state-of-the-art natural language
generation research from four major aspects -- Data, Models, Human-AI
Collaboration, and Evaluation -- and to discuss and showcase a few significant
and recent advances: (1) the use of non-retrogressive approaches; (2) the shift
from fine-tuning to prompting with large language models; (3) the development
of new learnable metric and fine-grained human evaluation framework; (4) a
growing body of studies and datasets on non-English languages; (5) the rise of
HCI+NLP+Accessibility interdisciplinary research to create real-world writing
assistant systems.Comment: To appear at ACL 2024, Tutoria
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment
The interactive nature of Large Language Models (LLMs) theoretically allows
models to refine and improve their answers, yet systematic analysis of the
multi-turn behavior of LLMs remains limited. In this paper, we propose the
FlipFlop experiment: in the first round of the conversation, an LLM responds to
a prompt containing a classification task. In a second round, the LLM is
challenged with a follow-up phrase like "Are you sure?", offering an
opportunity for the model to reflect on its initial answer, and decide whether
to confirm or flip its answer. A systematic study of nine LLMs on seven
classification tasks reveals that models flip their answers on average 46% of
the time and that all models see a deterioration of accuracy between their
first and final prediction, with an average drop of 17%. The FlipFlop
experiment illustrates the universality of sycophantic behavior in LLMs and
provides a robust framework to analyze model behavior and evaluate potential
solutions
Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems
Making big purchases requires consumers to research or consult a salesperson
to gain domain expertise. However, existing conversational recommender systems
(CRS) often overlook users' lack of background knowledge, focusing solely on
gathering preferences. In this work, we define a new problem space for
conversational agents that aim to provide both product recommendations and
educational value through mixed-type mixed-initiative dialog. We introduce
SalesOps, a framework that facilitates the simulation and evaluation of such
systems by leveraging recent advancements in large language models (LLMs). We
build SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate
either side of the framework. A comprehensive human study compares SalesBot
against professional salespeople, revealing that although SalesBot approaches
professional performance in terms of fluency and informativeness, it lags
behind in recommendation quality. We emphasize the distinct limitations both
face in providing truthful information, highlighting the challenges of ensuring
faithfulness in the CRS context. We release our code and make all data
available
Art or Artifice? Large Language Models and the False Promise of Creativity
Researchers have argued that large language models (LLMs) exhibit
high-quality writing capabilities from blogs to stories. However, evaluating
objectively the creativity of a piece of writing is challenging. Inspired by
the Torrance Test of Creative Thinking (TTCT), which measures creativity as a
process, we use the Consensual Assessment Technique [3] and propose the
Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product.
TTCW consists of 14 binary tests organized into the original dimensions of
Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative
writers and implement a human assessment of 48 stories written either by
professional authors or LLMs using TTCW. Our analysis shows that LLM-generated
stories pass 3-10X less TTCW tests than stories written by professionals. In
addition, we explore the use of LLMs as assessors to automate the TTCW
evaluation, revealing that none of the LLMs positively correlate with the
expert assessments
SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages
Text simplification research has mostly focused on sentence-level
simplification, even though many desirable edits - such as adding relevant
background information or reordering content - may require document-level
context. Prior work has also predominantly framed simplification as a
single-step, input-to-output task, only implicitly modeling the fine-grained,
span-level edits that elucidate the simplification process. To address both
gaps, we introduce the SWiPE dataset, which reconstructs the document-level
editing process from English Wikipedia (EW) articles to paired Simple Wikipedia
(SEW) articles. In contrast to prior work, SWiPE leverages the entire revision
history when pairing pages in order to better identify simplification edits. We
work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling
more than 40,000 edits with proposed 19 categories. To scale our efforts, we
propose several models to automatically label edits, achieving an F-1 score of
up to 70.6, indicating that this is a tractable but challenging NLU task.
Finally, we categorize the edits produced by several simplification models and
find that SWiPE-trained models generate more complex edits while reducing
unwanted edits.Comment: ACL 2023, Long Pape
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
Large language models (LLMs) have shown impressive performance in following
natural language instructions to solve unseen tasks. However, it remains
unclear whether models truly understand task definitions and whether the
human-written definitions are optimal. In this paper, we systematically study
the role of task definitions in instruction learning. We first conduct an
ablation analysis informed by human annotations to understand which parts of a
task definition are most important, and find that model performance only drops
substantially when removing contents describing the task output, in particular
label information. Next, we propose an automatic algorithm to compress task
definitions to a minimal supporting set of tokens, and find that 60\% of tokens
can be removed while maintaining or even improving model performance. Based on
these results, we propose two strategies to help models better leverage task
instructions: (1) providing only key information for tasks in a common
structured format, and (2) adding a meta-tuning stage to help the model better
understand the definitions. With these two strategies, we achieve a 4.2 Rouge-L
improvement over 119 unseen test tasks.Comment: ACL 2023, camera-ready; 10 page
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles
Previous research in multi-document news summarization has typically
concentrated on collating information that all sources agree upon. However, to
our knowledge, the summarization of diverse information dispersed across
multiple articles about an event has not been previously investigated. The
latter imposes a different set of challenges for a summarization model. In this
paper, we propose a new task of summarizing diverse information encountered in
multiple news articles encompassing the same event. To facilitate this task, we
outlined a data collection schema for identifying diverse information and
curated a dataset named DiverseSumm. The dataset includes 245 news stories,
with each story comprising 10 news articles and paired with a human-validated
reference. Moreover, we conducted a comprehensive analysis to pinpoint the
position and verbosity biases when utilizing Large Language Model (LLM)-based
metrics for evaluating the coverage and faithfulness of the summaries, as well
as their correlation with human assessments. We applied our findings to study
how LLMs summarize multiple news articles by analyzing which type of diverse
information LLMs are capable of identifying. Our analyses suggest that despite
the extraordinary capabilities of LLMs in single-document summarization, the
proposed task remains a complex challenge for them mainly due to their limited
coverage, with GPT-4 only able to cover less than 40% of the diverse
information on average
Next Steps for Human-Centered Generative AI: A Technical Perspective
Through iterative, cross-disciplinary discussions, we define and propose
next-steps for Human-centered Generative AI (HGAI) from a technical
perspective. We contribute a roadmap that lays out future directions of
Generative AI spanning three levels: Aligning with human values; Accommodating
humans' expression of intents; and Augmenting humans' abilities in a
collaborative workflow. This roadmap intends to draw interdisciplinary research
teams to a comprehensive list of emergent ideas in HGAI, identifying their
interested topics while maintaining a coherent big picture of the future work
landscape