203,175 research outputs found
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
Theory of mind (ToM) evaluations currently focus on testing models using
passive narratives that inherently lack interactivity. We introduce FANToM, a
new benchmark designed to stress-test ToM within information-asymmetric
conversational contexts via question answering. Our benchmark draws upon
important theoretical requisites from psychology and necessary empirical
considerations when evaluating large language models (LLMs). In particular, we
formulate multiple types of questions that demand the same underlying reasoning
to identify illusory or false sense of ToM capabilities in LLMs. We show that
FANToM is challenging for state-of-the-art LLMs, which perform significantly
worse than humans even with chain-of-thought reasoning or fine-tuning.Comment: EMNLP 2023. Code and dataset can be found here:
https://hyunw.kim/fanto
ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind
Theory of Mind (ToM), the capacity to comprehend the mental states of
distinct individuals, is essential for numerous practical applications. With
the development of large language models, there is a heated debate about
whether they are able to perform ToM tasks. Previous studies have used
different tasks and prompts to test the ToM on large language models and the
results are inconsistent: some studies asserted these models are capable of
exhibiting ToM, while others suggest the opposite. In this study, We present
ToMChallenges, a dataset for comprehensively evaluating Theory of Mind based on
Sally-Anne and Smarties tests. We created 30 variations of each test (e.g.,
changing the person's name, location, and items). For each variation, we test
the model's understanding of different aspects: reality, belief, 1st order
belief, and 2nd order belief. We adapt our data for various tasks by creating
unique prompts tailored for each task category: Fill-in-the-Blank, Multiple
Choice, True/False, Chain-of-Thought True/False, Question Answering, and Text
Completion. If the model has a robust ToM, it should be able to achieve good
performance for different prompts across different tests. We evaluated two
GPT-3.5 models, text-davinci-003 and gpt-3.5-turbo-0301, with our datasets. Our
results indicate that consistent performance in ToM tasks remains a challenge.Comment: work in progres
How to Evaluate your Question Answering System Every Day and Still Get Real Work Done
In this paper, we report on Qaviar, an experimental automated evaluation
system for question answering applications. The goal of our research was to
find an automatically calculated measure that correlates well with human
judges' assessment of answer correctness in the context of question answering
tasks. Qaviar judges the response by computing recall against the stemmed
content words in the human-generated answer key. It counts the answer correct
if it exceeds agiven recall threshold. We determined that the answer
correctness predicted by Qaviar agreed with the human 93% to 95% of the time.
41 question-answering systems were ranked by both Qaviar and human assessors,
and these rankings correlated with a Kendall's Tau measure of 0.920, compared
to a correlation of 0.956 between human assessors on the same data.Comment: 6 pages, 3 figures, to appear in Proceedings of the Second
International Conference on Language Resources and Evaluation (LREC 2000
Designing Effective Questions for Classroom Response System Teaching
Classroom response systems (CRSs) can be potent tools for teaching physics.
Their efficacy, however, depends strongly on the quality of the questions used.
Creating effective questions is difficult, and differs from creating exam and
homework problems. Every CRS question should have an explicit pedagogic purpose
consisting of a content goal, a process goal, and a metacognitive goal.
Questions can be engineered to fulfil their purpose through four complementary
mechanisms: directing students' attention, stimulating specific cognitive
processes, communicating information to instructor and students via
CRS-tabulated answer counts, and facilitating the articulation and
confrontation of ideas. We identify several tactics that help in the design of
potent questions, and present four "makeovers" showing how these tactics can be
used to convert traditional physics questions into more powerful CRS questions.Comment: 11 pages, including 6 figures and 2 tables. Submitted (and mostly
approved) to the American Journal of Physics. Based on invited talk BL05 at
the 2005 Winter Meeting of the American Association of Physics Teachers
(Albuquerque, NM
- …