8 research outputs found
Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models
Despite recent success in large language model (LLM) reasoning, LLMs struggle
with hierarchical multi-step reasoning tasks like generating complex programs.
For these tasks, humans often start with a high-level algorithmic design and
implement each part gradually. We introduce Parsel, a framework enabling
automatic implementation and validation of complex algorithms with code LLMs,
taking hierarchical function descriptions in natural language as input. We show
that Parsel can be used across domains requiring hierarchical reasoning,
including program synthesis, robotic planning, and theorem proving. We show
that LLMs generating Parsel solve more competition-level problems in the APPS
dataset, resulting in pass rates that are over 75% higher than prior results
from directly sampling AlphaCode and Codex, while often using a smaller sample
budget. We also find that LLM-generated robotic plans using Parsel as an
intermediate language are more than twice as likely to be considered accurate
than directly generated plans. Lastly, we explore how Parsel addresses LLM
limitations and discuss how Parsel may be useful for human programmers.Comment: new quantitative detail
Hypothesis Search: Inductive Reasoning with Language Models
Inductive reasoning is a core problem-solving capacity: humans can identify
underlying principles from a few examples, which can then be robustly
generalized to novel scenarios. Recent work has evaluated large language models
(LLMs) on inductive reasoning tasks by directly prompting them yielding "in
context learning." This can work well for straightforward inductive tasks, but
performs very poorly on more complex tasks such as the Abstraction and
Reasoning Corpus (ARC). In this work, we propose to improve the inductive
reasoning ability of LLMs by generating explicit hypotheses at multiple levels
of abstraction: we prompt the LLM to propose multiple abstract hypotheses about
the problem, in natural language, then implement the natural language
hypotheses as concrete Python programs. These programs can be directly verified
by running on the observed examples and generalized to novel inputs. Because of
the prohibitive cost of generation with state-of-the-art LLMs, we consider a
middle step to filter the set of hypotheses that will be implemented into
programs: we either ask the LLM to summarize into a smaller set of hypotheses,
or ask human annotators to select a subset of the hypotheses. We verify our
pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its
variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem
subset of ARC, our automated pipeline using LLM summaries achieves 27.5%
accuracy, significantly outperforming the direct prompting baseline (accuracy
of 12.5%). With the minimal human input of selecting from LLM-generated
candidates, the performance is boosted to 37.5%. (And we argue this is a lower
bound on the performance of our approach without filtering.) Our ablation
studies show that abstract hypothesis generation and concrete program
representations are both beneficial for LLMs to perform inductive reasoning
tasks
Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency
Developing an educational test can be expensive and time-consuming, as each
item must be written by experts and then evaluated by collecting hundreds of
student responses. Moreover, many tests require multiple distinct sets of
questions administered throughout the school year to closely monitor students'
progress, known as parallel tests. In this study, we focus on tests of silent
sentence reading efficiency, used to assess students' reading ability over
time. To generate high-quality parallel tests, we propose to fine-tune large
language models (LLMs) to simulate how previous students would have responded
to unseen items. With these simulated responses, we can estimate each item's
difficulty and ambiguity. We first use GPT-4 to generate new test items
following a list of expert-developed rules and then apply a fine-tuned LLM to
filter the items based on criteria from psychological measurements. We also
propose an optimal-transport-inspired technique for generating parallel tests
and show the generated tests closely correspond to the original test's
difficulty and reliability based on crowdworker responses. Our evaluation of a
generated test with 234 students from grades 2 to 8 produces test scores highly
correlated (r=0.93) to those of a standard test form written by human experts
and evaluated across thousands of K-12 students.Comment: Accepted to EMNLP 2023 (Main
Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics
Few images on the Web receive alt-text descriptions that would make them
accessible to blind and low vision (BLV) users. Image-based NLG systems have
progressed to the point where they can begin to address this persistent
societal problem, but these systems will not be fully successful unless we
evaluate them on metrics that guide their development correctly. Here, we argue
against current referenceless metrics -- those that don't rely on
human-generated ground-truth descriptions -- on the grounds that they do not
align with the needs of BLV users. The fundamental shortcoming of these metrics
is that they cannot take context into account, whereas contextual information
is highly valued by BLV users. To substantiate these claims, we present a study
with BLV participants who rated descriptions along a variety of dimensions. An
in-depth analysis reveals that the lack of context-awareness makes current
referenceless metrics inadequate for advancing image accessibility, requiring a
rethinking of referenceless evaluation metrics for image-based NLG systems