20 research outputs found
Towards Inter-character Relationship-driven Story Generation
In this paper, we introduce the task of modeling interpersonal relationships
for story generation. For addressing this task, we propose Relationships as
Latent Variables for Story Generation, (ReLiSt). ReLiSt generates stories
sentence by sentence and has two major components - a relationship selector and
a story continuer. The relationship selector specifies a latent variable to
pick the relationship to exhibit in the next sentence and the story continuer
generates the next sentence while expressing the selected relationship in a
coherent way. Our automatic and human evaluations demonstrate that ReLiSt is
able to generate stories with relationships that are more faithful to desired
relationships while maintaining the content quality. The relationship
assignments to sentences during inference bring interpretability to ReLiSt.Comment: EMNLP 202
Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers
The development of large language models (LLMs) capable of following
instructions and engaging in conversational interactions sparked increased
interest in their utilization across various support tools. We investigate the
utility of modern LLMs in assisting professional writers via an empirical user
study (n=30). The design of our collaborative writing interface is grounded in
the cognitive process model of writing that views writing as a goal-oriented
thinking process encompassing non-linear cognitive activities: planning,
translating, and reviewing. Participants are asked to submit a post-completion
survey to provide feedback on the potential and pitfalls of LLMs as writing
collaborators. Upon analyzing the writer-LLM interactions, we find that while
writers seek LLM's help across all three types of cognitive activities, they
find LLMs more helpful in translation and reviewing. Our findings from
analyzing both the interactions and the survey responses highlight future
research directions in creative writing assistance using LLMs
NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization
Narrative summarization aims to produce a distilled version of a narrative to
describe its most salient events and characters. Summarizing a narrative is
challenging as it requires an understanding of event causality and character
behaviors. To encourage research in this direction, we propose NarraSum, a
large-scale narrative summarization dataset. It contains 122K narrative
documents, which are collected from plot descriptions of movies and TV episodes
with diverse genres, and their corresponding abstractive summaries. Experiments
show that there is a large performance gap between humans and the
state-of-the-art summarization models on NarraSum. We hope that this dataset
will promote future research in summarization, as well as broader studies of
natural language understanding and generation. The dataset is available at
https://github.com/zhaochaocs/narrasum.Comment: EMNLP Findings 202
REV: Information-Theoretic Evaluation of Free-Text Rationales
Generating free-text rationales is a promising step towards explainable NLP,
yet evaluating such rationales remains a challenge. Existing metrics have
mostly focused on measuring the association between the rationale and a given
label. We argue that an ideal metric should focus on the new information
uniquely provided in the rationale that is otherwise not provided in the input
or the label. We investigate this research problem from an
information-theoretic perspective using conditional V-information (Hewitt et
al., 2021). More concretely, we propose a metric called REV (Rationale
Evaluation with conditional V-information), to quantify the amount of new,
label-relevant information in a rationale beyond the information already
available in the input or the label. Experiments across four benchmarks with
reasoning tasks, including chain-of-thought, demonstrate the effectiveness of
REV in evaluating rationale-label pairs, compared to existing metrics. We
further demonstrate REV is consistent with human judgments on rationale
evaluations and provides more sensitive measurements of new information in
free-text rationales. When used alongside traditional performance metrics, REV
provides deeper insights into models' reasoning and prediction processes.Comment: ACL 202
STEER: Unified Style Transfer with Expert Reinforcement
While text style transfer has many applications across natural language
processing, the core premise of transferring from a single source style is
unrealistic in a real-world setting. In this work, we focus on arbitrary style
transfer: rewriting a text from an arbitrary, unknown style to a target style.
We propose STEER: Unified Style Transfer with Expert Reinforcement, a unified
frame-work developed to overcome the challenge of limited parallel data for
style transfer. STEER involves automatically generating a corpus of
style-transfer pairs using a product of experts during decoding. The generated
offline data is then used to pre-train an initial policy before switching to
online, off-policy reinforcement learning for further improvements via
fine-grained reward signals. STEER is unified and can transfer to multiple
target styles from an arbitrary, unknown source style, making it particularly
flexible and efficient.
Experimental results on a challenging dataset with text from a diverse set of
styles demonstrate state-of-the-art results compared to competitive baselines.
Remarkably, STEER outperforms the 175B parameter instruction-tuned GPT-3 on
overall style transfer quality, despite being 226 times smaller in size. We
also show STEER is robust, maintaining its style transfer capabilities on
out-of-domain data, and surpassing nearly all baselines across various styles.
The success of our method highlights the potential of RL algorithms when
augmented with controllable decoding to overcome the challenge of limited data
supervision.Comment: for associated code, see
https://github.com/shallinan1/STEERStyleTransfe
Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models
Reinforcement Learning with Human Feedback (RLHF) is the most prominent
method for Language Model (LM) alignment. However, RLHF is an unstable and
data-hungry process that continually requires new high-quality LM-generated
data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new
class of offline policy gradient algorithms that enable RL training on any
pre-existing data. By assuming the entire LM output sequence as a single
action, A-LoL allows incorporating sequence-level classifiers or human-designed
scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL
only trains on positive advantage (leftover) data points, making it resilient
to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable
LM training recipe.
We demonstrate the effectiveness of A-LoL and its variants with a set of four
different language generation tasks. We compare against both online RL (PPO)
and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL
baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant
(HHA), LMs trained with A-LoL methods achieve the highest diversity while also
being rated more safe and helpful than the baselines according to humans.
Additionally, in the remaining three tasks, A-LoL could optimize multiple
distinct reward functions even when using noisy or suboptimal training data.
We also release our experimental code. https://github.com/abaheti95/LoL-RLComment: published at ICLR 202
Affective and Dynamic Beam Search for Story Generation
Storytelling's captivating potential makes it a fascinating research area,
with implications for entertainment, education, therapy, and cognitive studies.
In this paper, we propose Affective Story Generator (AffGen) for generating
interesting narratives. AffGen introduces "intriguing twists" in narratives by
employing two novel techniques-Dynamic Beam Sizing and Affective Reranking.
Dynamic Beam Sizing encourages less predictable, more captivating word choices
using a contextual multi-arm bandit model. Affective Reranking prioritizes
sentence candidates based on affect intensity. Our empirical evaluations, both
automatic and human, demonstrate AffGen's superior performance over existing
baselines in generating affectively charged and interesting narratives. Our
ablation study and analysis provide insights into the strengths and weaknesses
of AffGen.Comment: Accepted at EMNLP-findings 202
Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain Procedure Customization
How-to procedures, such as how to plant a garden, are now used by millions of
users, but sometimes need customizing to meet a user's specific needs, e.g.,
planting a garden without pesticides. Our goal is to measure and improve an
LLM's ability to perform such customization. Our approach is to test several
simple multi-LLM-agent architectures for customization, as well as an
end-to-end LLM, using a new evaluation set, called CustomPlans, of over 200
WikiHow procedures each with a customization need. We find that a simple
architecture with two LLM agents used sequentially performs best, one that
edits a generic how-to procedure and one that verifies its executability,
significantly outperforming (10.5% absolute) an end-to-end prompted LLM. This
suggests that LLMs can be configured reasonably effectively for procedure
customization. This also suggests that multi-agent editing architectures may be
worth exploring further for other customization applications (e.g. coding,
creative writing) in the future.Comment: Camera ready version accepted to Findings of ACL 202
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
We introduce Lumos, a novel framework for training language agents that
employs a unified data format and a modular architecture based on open-source
large language models (LLMs). Lumos consists of three distinct modules:
planning, grounding, and execution. The planning module breaks down a task into
a series of high-level, tool-agnostic subgoals, which are then made specific by
the grounding module through a set of low-level actions. These actions are
subsequently executed by the execution module, utilizing a range of
off-the-shelf tools and APIs. In order to train these modules effectively,
high-quality annotations of subgoals and actions were collected and are made
available for fine-tuning open-source LLMs for various tasks such as complex
question answering, web tasks, and math problems. Leveraging this unified data
and modular design, Lumos not only achieves comparable or superior performance
to current, state-of-the-art agents, but also exhibits several key advantages:
(1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and
web tasks, while equalling the performance of significantly larger LLM agents
on math tasks; (2) Lumos outperforms open-source agents created through
conventional training methods and those using chain-of-thoughts training; and
(3) Lumos is capable of effectively generalizing to unseen interactive tasks,
outperforming larger LLM-based agents and even exceeding performance of
specialized agents.Comment: Project website: https://allenai.github.io/lumos
MacGyver: Are Large Language Models Creative Problem Solvers?
We explore the creative problem-solving capabilities of modern LLMs in a
novel constrained setting. To this end, we create MACGYVER, an automatically
generated dataset consisting of over 1,600 real-world problems deliberately
designed to trigger innovative usage of objects and necessitate out-of-the-box
thinking. We then present our collection to both LLMs and humans to compare and
contrast their problem-solving abilities. MACGYVER is challenging for both
groups, but in unique and complementary ways. For instance, humans excel in
tasks they are familiar with but struggle with domain-specific knowledge,
leading to a higher variance. In contrast, LLMs, exposed to a variety of
specialized knowledge, attempt broader problems but fail by proposing
physically-infeasible actions. Finally, we provide a detailed error analysis of
LLMs, and demonstrate the potential of enhancing their problem-solving ability
with novel prompting techniques such as iterative step-wise reflection and
divergent-convergent thinking.
This work (1) introduces a fresh arena for intelligent agents focusing on
intricate aspects of physical reasoning, planning, and unconventional thinking,
which supplements the existing spectrum of machine intelligence; and (2)
provides insight into the constrained problem-solving capabilities of both
humans and AI.Comment: NAACL 202