73 research outputs found
Transferring Procedural Knowledge across Commonsense Tasks
Stories about everyday situations are an essential part of human
communication, motivating the need to develop AI agents that can reliably
understand these stories. Despite the long list of supervised methods for story
completion and procedural understanding, current AI has no mechanisms to
automatically track and explain procedures in unseen stories. To bridge this
gap, we study the ability of AI models to transfer procedural knowledge to
novel narrative tasks in a transparent manner. We design LEAP: a comprehensive
framework that integrates state-of-the-art modeling architectures, training
regimes, and augmentation strategies based on both natural and synthetic
stories. To address the lack of densely annotated training data, we devise a
robust automatic labeler based on few-shot prompting to enhance the augmented
data. Our experiments with in- and out-of-domain tasks reveal insights into the
interplay of different architectures, training regimes, and augmentation
strategies. LEAP's labeler has a clear positive impact on out-of-domain
datasets, while the resulting dense annotation provides native explainability
BRAINTEASER: Lateral Thinking Puzzles for Large Language Models
The success of language models has inspired the NLP community to attend to
tasks that require implicit and complex reasoning, relying on human-like
commonsense mechanisms. While such vertical thinking tasks have been relatively
popular, lateral thinking puzzles have received little attention. To bridge
this gap, we devise BRAINTEASER: a multiple-choice Question Answering task
designed to test the model's ability to exhibit lateral thinking and defy
default commonsense associations. We design a three-step procedure for creating
the first lateral thinking benchmark, consisting of data collection, distractor
generation, and generation of adversarial examples, leading to 1,100 puzzles
with high-quality annotations. To assess the consistency of lateral reasoning
by models, we enrich BRAINTEASER based on a semantic and contextual
reconstruction of its questions. Our experiments with state-of-the-art
instruction- and commonsense language models reveal a significant gap between
human and model performance, which is further widened when consistency across
adversarial formats is considered. We make all of our code and data available
to stimulate work on developing and evaluating lateral thinking models
Knowledge-enhanced Agents for Interactive Text Games
Communication via natural language is a key aspect of machine intelligence,
and it requires computational models to learn and reason about world concepts,
with varying levels of supervision. Significant progress has been made on
fully-supervised non-interactive tasks, such as question-answering and
procedural text understanding. Yet, various sequential interactive tasks, as in
text-based games, have revealed limitations of existing approaches in terms of
coherence, contextual awareness, and their ability to learn effectively from
the environment. In this paper, we propose a knowledge-injection framework for
improved functional grounding of agents in text-based games. Specifically, we
consider two forms of domain knowledge that we inject into learning-based
agents: memory of previous correct actions and affordances of relevant objects
in the environment. Our framework supports two representative model classes:
reinforcement learning agents and language model agents. Furthermore, we devise
multiple injection strategies for the above domain knowledge types and agent
architectures, including injection via knowledge graphs and augmentation of the
existing input encoding strategies. We experiment with four models on the 10
tasks in the ScienceWorld text-based game environment, to illustrate the impact
of knowledge injection on various model configurations and challenging task
settings. Our findings provide crucial insights into the interplay between task
properties, model architectures, and domain knowledge for interactive contexts.Comment: Published at K-CAP '2
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
Retrieval-augmented language models (RALMs) represent a substantial
advancement in the capabilities of large language models, notably in reducing
factual hallucination by leveraging external knowledge sources. However, the
reliability of the retrieved information is not always guaranteed. The
retrieval of irrelevant data can lead to misguided responses, and potentially
causing the model to overlook its inherent knowledge, even when it possesses
adequate information to address the query. Moreover, standard RALMs often
struggle to assess whether they possess adequate knowledge, both intrinsic and
retrieved, to provide an accurate answer. In situations where knowledge is
lacking, these systems should ideally respond with "unknown" when the answer is
unattainable. In response to these challenges, we introduces Chain-of-Noting
(CoN), a novel approach aimed at improving the robustness of RALMs in facing
noisy, irrelevant documents and in handling unknown scenarios. The core idea of
CoN is to generate sequential reading notes for retrieved documents, enabling a
thorough evaluation of their relevance to the given question and integrating
this information to formulate the final answer. We employed ChatGPT to create
training data for CoN, which was subsequently trained on an LLaMa-2 7B model.
Our experiments across four open-domain QA benchmarks show that RALMs equipped
with CoN significantly outperform standard RALMs. Notably, CoN achieves an
average improvement of +7.9 in EM score given entirely noisy retrieved
documents and +10.5 in rejection rates for real-time questions that fall
outside the pre-training knowledge scope.Comment: Preprin
Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering
Recent developments in pre-trained neural language modeling have led to leaps
in accuracy on commonsense question-answering benchmarks. However, there is
increasing concern that models overfit to specific tasks, without learning to
utilize external knowledge or perform general semantic reasoning. In contrast,
zero-shot evaluations have shown promise as a more robust measure of a model's
general reasoning abilities. In this paper, we propose a novel neuro-symbolic
framework for zero-shot question answering across commonsense tasks. Guided by
a set of hypotheses, the framework studies how to transform various
pre-existing knowledge resources into a form that is most effective for
pre-training models. We vary the set of language models, training regimes,
knowledge sources, and data generation strategies, and measure their impact
across tasks. Extending on prior work, we devise and compare four constrained
distractor-sampling strategies. We provide empirical results across five
commonsense question-answering tasks with data generated from five external
knowledge resources. We show that, while an individual knowledge graph is
better suited for specific tasks, a global knowledge graph brings consistent
gains across different tasks. In addition, both preserving the structure of the
task as well as generating fair and informative questions help language models
learn more effectively.Comment: AAAI 202
LASER: LLM Agent with State-Space Exploration for Web Navigation
Large language models (LLMs) have been successfully adapted for interactive
decision-making tasks like web navigation. While achieving decent performance,
previous methods implicitly assume a forward-only execution mode for the model,
where they only provide oracle trajectories as in-context examples to guide the
model on how to reason in the environment. Consequently, the model could not
handle more challenging scenarios not covered in the in-context examples, e.g.,
mistakes, leading to sub-optimal performance. To address this issue, we propose
to model the interactive task as state space exploration, where the LLM agent
transitions among a pre-defined set of states by performing actions to complete
the task. This formulation enables flexible backtracking, allowing the model to
recover from errors easily. We evaluate our proposed LLM Agent with State-Space
ExploRation (LASER) on both the WebShop task and amazon.com. Experimental
results show that LASER significantly outperforms previous methods and closes
the gap with human performance on the web navigation task.Comment: 4 pages, 2 figure
- …