3 research outputs found
Evaluating the Ripple Effects of Knowledge Editing in Language Models
Modern language models capture a large body of factual knowledge. However,
some facts can be incorrectly induced or become obsolete over time, resulting
in factually incorrect generations. This has led to the development of various
editing methods that allow updating facts encoded by the model. Evaluation of
these methods has primarily focused on testing whether an individual fact has
been successfully injected, and if similar predictions for other subjects have
not changed. Here we argue that such evaluation is limited, since injecting one
fact (e.g. ``Jack Depp is the son of Johnny Depp'') introduces a ``ripple
effect'' in the form of additional facts that the model needs to update
(e.g.``Jack Depp is the sibling of Lily-Rose Depp''). To address this issue, we
propose a novel set of evaluation criteria that consider the implications of an
edit on related facts. Using these criteria, we then construct \ripple{}, a
diagnostic benchmark of 5K factual edits, capturing a variety of types of
ripple effects. We evaluate prominent editing methods on \ripple{}, showing
that current methods fail to introduce consistent changes in the model's
knowledge. In addition, we find that a simple in-context editing baseline
obtains the best scores on our benchmark, suggesting a promising research
direction for model editing
Answering Questions by Meta-Reasoning over Multiple Chains of Thought
Modern systems for multi-hop question answering (QA) typically break
questions into a sequence of reasoning steps, termed chain-of-thought (CoT),
before arriving at a final answer. Often, multiple chains are sampled and
aggregated through a voting mechanism over the final answers, but the
intermediate steps themselves are discarded. While such approaches improve
performance, they do not consider the relations between intermediate steps
across chains and do not provide a unified explanation for the predicted
answer. We introduce Multi-Chain Reasoning (MCR), an approach which prompts
large language models to meta-reason over multiple chains of thought, rather
than aggregating their answers. MCR examines different reasoning chains, mixes
information between them and selects the most relevant facts in generating an
explanation and predicting the answer. MCR outperforms strong baselines on 7
multi-hop QA datasets. Moreover, our analysis reveals that MCR explanations
exhibit high quality, enabling humans to verify its answers
QAMPARI: : An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs
Existing benchmarks for open-domain question answering (ODQA) typically focus
on questions whose answers can be extracted from a single paragraph. By
contrast, many natural questions, such as "What players were drafted by the
Brooklyn Nets?" have a list of answers. Answering such questions requires
retrieving and reading from many passages, in a large corpus. We introduce
QAMPARI, an ODQA benchmark, where question answers are lists of entities,
spread across many paragraphs. We created QAMPARI by (a) generating questions
with multiple answers from Wikipedia's knowledge graph and tables, (b)
automatically pairing answers with supporting evidence in Wikipedia paragraphs,
and (c) manually paraphrasing questions and validating each answer. We train
ODQA models from the retrieve-and-read family and find that QAMPARI is
challenging in terms of both passage retrieval and answer generation, reaching
an F1 score of 26.6 at best. Our results highlight the need for developing ODQA
models that handle a broad range of question types, including single and
multi-answer questions